Re: [PATCH V4 6/7] kvm tools: Add PPC64 PCI Host Bridge

2012-01-31 Thread Alexey Kardashevskiy
On 01/02/12 14:40, David Gibson wrote:
 On Tue, Jan 31, 2012 at 05:34:41PM +1100, Matt Evans wrote:
 This provides the PCI bridge, definitions for the address layout of the 
 windows
 and wires in IRQs.  Once PCI devices are all registered, they are enumerated 
 and
 DT nodes generated for each.

 Signed-off-by: Matt Evans m...@ozlabs.org
 
 For the bits derived from my qemu code:
 
 Signed-off-by: David Gibson da...@gibson.dropbear.id.au
 

For the bits derived from my qemu code:

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] adding MSI/MSIX for PCI on POWER

2012-06-13 Thread Alexey Kardashevskiy
The following patches add MSIX support for PCI on POWER.
The first aim is virtio-pci so it was tested. It will also support
VFIO when it becomes available in public.

Alexey Kardashevskiy (3):
  msi/msix: added functions to API to set up message address and data
  pseries: added allocator for a block of IRQs
  pseries pci: added MSI/MSIX support

 hw/msi.c   |   14 +++
 hw/msi.h   |1 +
 hw/msix.c  |   10 ++
 hw/msix.h  |3 +
 hw/spapr.c |   26 +-
 hw/spapr.h |1 +
 hw/spapr_pci.c |  266 +--
 hw/spapr_pci.h |   13 +++-
 trace-events   |9 ++
 9 files changed, 331 insertions(+), 12 deletions(-)

-- 
1.7.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] msi/msix: added functions to API to set up message address and data

2012-06-13 Thread Alexey Kardashevskiy

Normally QEMU expects the guest to initialize MSI/MSIX vectors.
However on POWER the guest uses RTAS subsystem to configure MSI/MSIX and
does not write these vectors to device's config space or MSIX BAR.

On the other hand, msi_notify()/msix_notify() write to these vectors to
signal the guest about an interrupt so we have to write correct vectors
to the devices in order not to change every user of MSI/MSIX.

The first aim is to support MSIX for virtio-pci on POWER. There is
another patch for POWER coming which introduces a special memory region
where MSI/MSIX vectors point to.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 hw/msi.c  |   14 ++
 hw/msi.h  |1 +
 hw/msix.c |   10 ++
 hw/msix.h |3 +++
 4 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/hw/msi.c b/hw/msi.c
index 5d6ceb6..124878a 100644
--- a/hw/msi.c
+++ b/hw/msi.c
@@ -358,3 +358,17 @@ unsigned int msi_nr_vectors_allocated(const PCIDevice *dev)
 uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 return msi_nr_vectors(flags);
 }
+
+void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data)
+{
+uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
+bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
+
+if (msi64bit) {
+pci_set_quad(dev-config + msi_address_lo_off(dev), address);
+} else {
+pci_set_long(dev-config + msi_address_lo_off(dev), address);
+}
+pci_set_word(dev-config + msi_data_off(dev, msi64bit), data);
+}
+
diff --git a/hw/msi.h b/hw/msi.h
index 3040bb0..0acf434 100644
--- a/hw/msi.h
+++ b/hw/msi.h
@@ -34,6 +34,7 @@ void msi_reset(PCIDevice *dev);
 void msi_notify(PCIDevice *dev, unsigned int vector);
 void msi_write_config(PCIDevice *dev, uint32_t addr, uint32_t val, int len);
 unsigned int msi_nr_vectors_allocated(const PCIDevice *dev);
+void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data);

 static inline bool msi_present(const PCIDevice *dev)
 {
diff --git a/hw/msix.c b/hw/msix.c
index 3835eaa..c57c299 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -414,3 +414,13 @@ void msix_unuse_all_vectors(PCIDevice *dev)
 return;
 msix_free_irq_entries(dev);
 }
+
+void msix_set_address_data(PCIDevice *dev, int vector,
+   uint64_t address, uint32_t data)
+{
+uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE;
+pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, address);
+pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, data);
+table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
+}
+
diff --git a/hw/msix.h b/hw/msix.h
index 5aba22b..e6bb696 100644
--- a/hw/msix.h
+++ b/hw/msix.h
@@ -29,4 +29,7 @@ void msix_notify(PCIDevice *dev, unsigned vector);

 void msix_reset(PCIDevice *dev);

+void msix_set_address_data(PCIDevice *dev, int vector,
+   uint64_t address, uint32_t data);
+
 #endif
-- 
1.7.7.3
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] pseries: added allocator for a block of IRQs

2012-06-13 Thread Alexey Kardashevskiy

The patch adds a simple helper which allocates a consecutive sequence
of IRQs calling spapr_allocate_irq for each and checks that allocated
IRQs go consequently.

The patch is required for upcoming support of MSI/MSIX on POWER.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 hw/spapr.c |   19 +++
 hw/spapr.h |1 +
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/hw/spapr.c b/hw/spapr.c
index 2e0b4b8..ef6ffcb 100644
--- a/hw/spapr.c
+++ b/hw/spapr.c
@@ -113,6 +113,25 @@ qemu_irq spapr_allocate_irq(uint32_t hint, uint32_t 
*irq_num,
 return qirq;
 }

+/* Allocate block of consequtive IRQs, returns a number of the first */
+int spapr_allocate_irq_block(uint32_t num, enum xics_irq_type type)
+{
+int i, ret;
+uint32_t irq = -1;
+
+for (i = 0; i  num; ++i) {
+if (!spapr_allocate_irq(0, irq, type)) {
+return -1;
+}
+if (0 == i) {
+ret = irq;
+} else if (ret + i != irq) {
+return -1;
+}
+}
+return ret;
+}
+
 static int spapr_set_associativity(void *fdt, sPAPREnvironment *spapr)
 {
 int ret = 0, offset;
diff --git a/hw/spapr.h b/hw/spapr.h
index 502393a..408b470 100644
--- a/hw/spapr.h
+++ b/hw/spapr.h
@@ -289,6 +289,7 @@ target_ulong spapr_hypercall(CPUPPCState *env, target_ulong 
opcode,

 qemu_irq spapr_allocate_irq(uint32_t hint, uint32_t *irq_num,
 enum xics_irq_type type);
+int spapr_allocate_irq_block(uint32_t num, enum xics_irq_type type);

 static inline qemu_irq spapr_allocate_msi(uint32_t hint, uint32_t *irq_num)
 {
-- 
1.7.7.3
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] pseries pci: added MSI/MSIX support

2012-06-13 Thread Alexey Kardashevskiy
virtio-pci expects the guest to set up MSI message address and data, and
to do other initialization such as a vector number negotiation.
It also notifies the guest via writing an MSI message to a previously set
address.

This patch includes:

1. RTAS call ibm,change-msi which sets up number of MSI vectors per
a device. Note that this call may configure and return lesser number of
vectors than requested.

2. RTAS call ibm,query-interrupt-source-number which translates MSI
vector to interrupt controller (XICS) IRQ number.

3. A config_space_address to msi_table map to provide IRQ resolving from
config-address as MSI RTAS calls take a PCI config space address as
an identifier.

4. A MSIX memory region is added to catch msi_notify()/msix_notiry()
from virtio-pci and pass them to the guest via qemu_irq_pulse().

This patch depends on the msi/msix: added functions to API to set up
message address and data patch.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 hw/spapr.c |9 ++-
 hw/spapr_pci.c |  266 +--
 hw/spapr_pci.h |   13 +++-
 trace-events   |9 ++
 4 files changed, 284 insertions(+), 13 deletions(-)

diff --git a/hw/spapr.c b/hw/spapr.c
index ef6ffcb..35ad075 100644
--- a/hw/spapr.c
+++ b/hw/spapr.c
@@ -43,6 +43,7 @@
 #include hw/spapr_vio.h
 #include hw/spapr_pci.h
 #include hw/xics.h
+#include hw/msi.h

 #include kvm.h
 #include kvm_ppc.h
@@ -82,6 +83,7 @@
 #define SPAPR_PCI_MEM_WIN_ADDR  (0x100ULL + 0xA000)
 #define SPAPR_PCI_MEM_WIN_SIZE  0x2000
 #define SPAPR_PCI_IO_WIN_ADDR   (0x100ULL + 0x8000)
+#define SPAPR_PCI_MSI_WIN_ADDR  (0x100ULL + 0x9000)

 #define PHANDLE_XICP0x

@@ -116,7 +118,7 @@ qemu_irq spapr_allocate_irq(uint32_t hint, uint32_t 
*irq_num,
 /* Allocate block of consequtive IRQs, returns a number of the first */
 int spapr_allocate_irq_block(uint32_t num, enum xics_irq_type type)
 {
-int i, ret;
+int i, ret = 0;
 uint32_t irq = -1;

 for (i = 0; i  num; ++i) {
@@ -690,6 +692,8 @@ static void ppc_spapr_init(ram_addr_t ram_size,
 long pteg_shift = 17;
 char *filename;

+msi_supported = true;
+
 spapr = g_malloc0(sizeof(*spapr));
 QLIST_INIT(spapr-phbs);

@@ -804,7 +808,8 @@ static void ppc_spapr_init(ram_addr_t ram_size,
 spapr_create_phb(spapr, pci, SPAPR_PCI_BUID,
  SPAPR_PCI_MEM_WIN_ADDR,
  SPAPR_PCI_MEM_WIN_SIZE,
- SPAPR_PCI_IO_WIN_ADDR);
+ SPAPR_PCI_IO_WIN_ADDR,
+ SPAPR_PCI_MSI_WIN_ADDR);

 for (i = 0; i  nb_nics; i++) {
 NICInfo *nd = nd_table[i];
diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
index 93017cd..21fbc50 100644
--- a/hw/spapr_pci.c
+++ b/hw/spapr_pci.c
@@ -24,31 +24,46 @@
  */
 #include hw.h
 #include pci.h
+#include msix.h
+#include msi.h
 #include pci_host.h
 #include hw/spapr.h
 #include hw/spapr_pci.h
 #include exec-memory.h
 #include libfdt.h
+#include trace.h

 #include hw/pci_internals.h

-static PCIDevice *find_dev(sPAPREnvironment *spapr,
-   uint64_t buid, uint32_t config_addr)
+static sPAPRPHBState *find_phb(sPAPREnvironment *spapr, uint64_t buid)
 {
-DeviceState *qdev;
-int devfn = (config_addr  8)  0xFF;
 sPAPRPHBState *phb;

 QLIST_FOREACH(phb, spapr-phbs, list) {
 if (phb-buid != buid) {
 continue;
 }
+return phb;
+}

-QTAILQ_FOREACH(qdev, phb-host_state.bus-qbus.children, sibling) {
-PCIDevice *dev = (PCIDevice *)qdev;
-if (dev-devfn == devfn) {
-return dev;
-}
+return NULL;
+}
+
+static PCIDevice *find_dev(sPAPREnvironment *spapr, uint64_t buid,
+   uint32_t config_addr)
+{
+sPAPRPHBState *phb = find_phb(spapr, buid);
+DeviceState *qdev;
+int devfn = (config_addr  8)  0xFF;
+
+if (!phb) {
+return NULL;
+}
+
+QTAILQ_FOREACH(qdev, phb-host_state.bus-qbus.children, sibling) {
+PCIDevice *dev = (PCIDevice *)qdev;
+if (dev-devfn == devfn) {
+return dev;
 }
 }

@@ -138,6 +153,220 @@ static void rtas_write_pci_config(sPAPREnvironment *spapr,
 rtas_st(rets, 0, 0);
 }

+/*
+ * Initializes req_num vectors for a device.
+ * The code assumes that MSI/MSIX is enabled in the config space
+ * as a result of msix_init() or msi_init().
+ */
+static int spapr_pci_config_msi(sPAPRPHBState *ph, int ndev,
+PCIDevice *pdev, bool msix, unsigned req_num)
+{
+unsigned i;
+int irq;
+uint64_t msi_address;
+uint32_t config_addr = pdev-devfn  8;
+
+/* Disabling - nothing to do */
+if (0 == req_num) {
+return 0;
+}
+
+/* Enabling! */
+if (ph-msi_table[ndev].nvec  (req_num != ph-msi_table[ndev].nvec)) {
+/* Unexpected behaviour */
+fprintf(stderr, Cannot reuse cached MSI config

[PATCH] pseries pci: removed redundand busdev

2012-06-13 Thread Alexey Kardashevskiy
The PCIHostState struct already contains SysBusDevice so
the one in sPAPRPHBState has to go.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 hw/spapr_pci.c |4 ++--
 hw/spapr_pci.h |1 -
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
index 75943cf..1c0b605 100644
--- a/hw/spapr_pci.c
+++ b/hw/spapr_pci.c
@@ -215,7 +215,7 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, 
void *opaque,

 static int spapr_phb_init(SysBusDevice *s)
 {
-sPAPRPHBState *phb = FROM_SYSBUS(sPAPRPHBState, s);
+sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s);
 char *namebuf;
 int i;
 PCIBus *bus;
@@ -253,7 +253,7 @@ static int spapr_phb_init(SysBusDevice *s)
 memory_region_add_subregion(get_system_memory(), phb-io_win_addr,
 phb-iowindow);

-bus = pci_register_bus(phb-busdev.qdev,
+bus = pci_register_bus(phb-host_state.busdev.qdev,
phb-busname ? phb-busname : phb-dtbusname,
pci_spapr_set_irq, pci_spapr_map_irq, phb,
phb-memspace, phb-iospace,
diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
index d9e46e2..a141764 100644
--- a/hw/spapr_pci.h
+++ b/hw/spapr_pci.h
@@ -28,7 +28,6 @@
 #include hw/xics.h

 typedef struct sPAPRPHBState {
-SysBusDevice busdev;
 PCIHostState host_state;

 uint64_t buid;
-- 
1.7.7.3
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] pseries pci: spapr_populate_pci_devices renamed to spapr_populate_pci_dt

2012-06-13 Thread Alexey Kardashevskiy
spapr_populate_pci_devices() populates the device tree only with bus
properties and has nothing to do with the devices on it as PCI BAR
allocation is done by the system firmware (SLOF).

New name - spapr_populate_pci_dt() - describes the functionality better.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 hw/spapr.c |2 +-
 hw/spapr_pci.c |6 +++---
 hw/spapr_pci.h |6 +++---
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/hw/spapr.c b/hw/spapr.c
index 47b26ee..2e0b4b8 100644
--- a/hw/spapr.c
+++ b/hw/spapr.c
@@ -551,7 +551,7 @@ static void spapr_finalize_fdt(sPAPREnvironment *spapr,
 }

 QLIST_FOREACH(phb, spapr-phbs, list) {
-ret = spapr_populate_pci_devices(phb, PHANDLE_XICP, fdt);
+ret = spapr_populate_pci_dt(phb, PHANDLE_XICP, fdt);
 }

 if (ret  0) {
diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
index 1c0b605..269dbbf 100644
--- a/hw/spapr_pci.c
+++ b/hw/spapr_pci.c
@@ -345,9 +345,9 @@ void spapr_create_phb(sPAPREnvironment *spapr,
 #define b_fff(x)b_x((x), 8, 3)  /* function number */
 #define b_(x)   b_x((x), 0, 8)  /* register number */

-int spapr_populate_pci_devices(sPAPRPHBState *phb,
-   uint32_t xics_phandle,
-   void *fdt)
+int spapr_populate_pci_dt(sPAPRPHBState *phb,
+  uint32_t xics_phandle,
+  void *fdt)
 {
 int bus_off, i, j;
 char nodename[256];
diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
index a141764..dd66f4b 100644
--- a/hw/spapr_pci.h
+++ b/hw/spapr_pci.h
@@ -55,8 +55,8 @@ void spapr_create_phb(sPAPREnvironment *spapr,
   uint64_t mem_win_addr, uint64_t mem_win_size,
   uint64_t io_win_addr);

-int spapr_populate_pci_devices(sPAPRPHBState *phb,
-   uint32_t xics_phandle,
-   void *fdt);
+int spapr_populate_pci_dt(sPAPRPHBState *phb,
+  uint32_t xics_phandle,
+  void *fdt);

 #endif /* __HW_SPAPR_PCI_H__ */
-- 
1.7.7.3
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] trace: added ability to comment out events in the list

2012-06-13 Thread Alexey Kardashevskiy
It is convenient for debug to be able to switch on/off some events easily.
The only possibility now is to remove event name from the file completely
and type it again when we want it back.

The patch adds '#' symbol handling as a comment specifier.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 trace/control.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/trace/control.c b/trace/control.c
index 4c5527d..22d5863 100644
--- a/trace/control.c
+++ b/trace/control.c
@@ -27,6 +27,9 @@ void trace_backend_init_events(const char *fname)
 size_t len = strlen(line_buf);
 if (len  1) {  /* skip empty lines */
 line_buf[len - 1] = '\0';
+if ('#' == line_buf[0]) { /* skip commented lines */
+continue;
+}
 if (!trace_event_set_state(line_buf, true)) {
 fprintf(stderr,
 error: trace event '%s' does not exist\n, line_buf);
-- 
1.7.7.3
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] adding MSI/MSIX for PCI on POWER

2012-06-13 Thread Alexey Kardashevskiy
Forgot to CC: someone :)

On 14/06/12 14:29, Alexey Kardashevskiy wrote:
 The following patches add MSIX support for PCI on POWER.
 The first aim is virtio-pci so it was tested. It will also support
 VFIO when it becomes available in public.
 
 Alexey Kardashevskiy (3):
   msi/msix: added functions to API to set up message address and data
   pseries: added allocator for a block of IRQs
   pseries pci: added MSI/MSIX support
 
  hw/msi.c   |   14 +++
  hw/msi.h   |1 +
  hw/msix.c  |   10 ++
  hw/msix.h  |3 +
  hw/spapr.c |   26 +-
  hw/spapr.h |1 +
  hw/spapr_pci.c |  266 +--
  hw/spapr_pci.h |   13 +++-
  trace-events   |9 ++
  9 files changed, 331 insertions(+), 12 deletions(-)
 


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/3] msi/msix: added functions to API to set up message address and data

2012-06-13 Thread Alexey Kardashevskiy
On 14/06/12 14:56, Alex Williamson wrote:
 On Thu, 2012-06-14 at 14:31 +1000, Alexey Kardashevskiy wrote:
 Normally QEMU expects the guest to initialize MSI/MSIX vectors.
 However on POWER the guest uses RTAS subsystem to configure MSI/MSIX and
 does not write these vectors to device's config space or MSIX BAR.

 On the other hand, msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so we have to write correct vectors
 to the devices in order not to change every user of MSI/MSIX.

 The first aim is to support MSIX for virtio-pci on POWER. There is
 another patch for POWER coming which introduces a special memory region
 where MSI/MSIX vectors point to.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  hw/msi.c  |   14 ++
  hw/msi.h  |1 +
  hw/msix.c |   10 ++
  hw/msix.h |3 +++
  4 files changed, 28 insertions(+), 0 deletions(-)

 diff --git a/hw/msi.c b/hw/msi.c
 index 5d6ceb6..124878a 100644
 --- a/hw/msi.c
 +++ b/hw/msi.c
 @@ -358,3 +358,17 @@ unsigned int msi_nr_vectors_allocated(const PCIDevice 
 *dev)
  uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
  return msi_nr_vectors(flags);
  }
 +
 +void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data)
 +{
 +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 +bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
 +
 +if (msi64bit) {
 +pci_set_quad(dev-config + msi_address_lo_off(dev), address);
 +} else {
 +pci_set_long(dev-config + msi_address_lo_off(dev), address);
 +}
 +pci_set_word(dev-config + msi_data_off(dev, msi64bit), data);
 +}
 
 Why not make it msi_set_message() and pass MSIMessage?  I'd be great if
 you tossed in a msi_get_message() as well, I think we need it to be able
 to do a kvm_irqchip_add_msi_route() with MSI.  Thanks,


I am missing the point. What is that MSIMessage?
It is just an address and data, making a struct from this is a bit too much :)
I am totally unfamiliar with kvm_irqchip_add_msi_route to see the bigger 
picture, sorry.


 Alex
 
 +
 diff --git a/hw/msi.h b/hw/msi.h
 index 3040bb0..0acf434 100644
 --- a/hw/msi.h
 +++ b/hw/msi.h
 @@ -34,6 +34,7 @@ void msi_reset(PCIDevice *dev);
  void msi_notify(PCIDevice *dev, unsigned int vector);
  void msi_write_config(PCIDevice *dev, uint32_t addr, uint32_t val, int len);
  unsigned int msi_nr_vectors_allocated(const PCIDevice *dev);
 +void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data);

  static inline bool msi_present(const PCIDevice *dev)
  {
 diff --git a/hw/msix.c b/hw/msix.c
 index 3835eaa..c57c299 100644
 --- a/hw/msix.c
 +++ b/hw/msix.c
 @@ -414,3 +414,13 @@ void msix_unuse_all_vectors(PCIDevice *dev)
  return;
  msix_free_irq_entries(dev);
  }
 +
 +void msix_set_address_data(PCIDevice *dev, int vector,
 +   uint64_t address, uint32_t data)
 +{
 +uint8_t *table_entry = dev-msix_table_page + vector * 
 PCI_MSIX_ENTRY_SIZE;
 +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, address);
 +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, data);
 +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
 +}
 +
 diff --git a/hw/msix.h b/hw/msix.h
 index 5aba22b..e6bb696 100644
 --- a/hw/msix.h
 +++ b/hw/msix.h
 @@ -29,4 +29,7 @@ void msix_notify(PCIDevice *dev, unsigned vector);

  void msix_reset(PCIDevice *dev);

 +void msix_set_address_data(PCIDevice *dev, int vector,
 +   uint64_t address, uint32_t data);
 +
  #endif
 
 
 


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/3] msi/msix: added functions to API to set up message address and data

2012-06-13 Thread Alexey Kardashevskiy
On 14/06/12 15:38, Alex Williamson wrote:
 On Thu, 2012-06-14 at 15:17 +1000, Alexey Kardashevskiy wrote:
 On 14/06/12 14:56, Alex Williamson wrote:
 On Thu, 2012-06-14 at 14:31 +1000, Alexey Kardashevskiy wrote:
 Normally QEMU expects the guest to initialize MSI/MSIX vectors.
 However on POWER the guest uses RTAS subsystem to configure MSI/MSIX and
 does not write these vectors to device's config space or MSIX BAR.

 On the other hand, msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so we have to write correct vectors
 to the devices in order not to change every user of MSI/MSIX.

 The first aim is to support MSIX for virtio-pci on POWER. There is
 another patch for POWER coming which introduces a special memory region
 where MSI/MSIX vectors point to.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  hw/msi.c  |   14 ++
  hw/msi.h  |1 +
  hw/msix.c |   10 ++
  hw/msix.h |3 +++
  4 files changed, 28 insertions(+), 0 deletions(-)

 diff --git a/hw/msi.c b/hw/msi.c
 index 5d6ceb6..124878a 100644
 --- a/hw/msi.c
 +++ b/hw/msi.c
 @@ -358,3 +358,17 @@ unsigned int msi_nr_vectors_allocated(const PCIDevice 
 *dev)
  uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
  return msi_nr_vectors(flags);
  }
 +
 +void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data)
 +{
 +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 +bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
 +
 +if (msi64bit) {
 +pci_set_quad(dev-config + msi_address_lo_off(dev), address);
 +} else {
 +pci_set_long(dev-config + msi_address_lo_off(dev), address);
 +}
 +pci_set_word(dev-config + msi_data_off(dev, msi64bit), data);
 +}

 Why not make it msi_set_message() and pass MSIMessage?  I'd be great if
 you tossed in a msi_get_message() as well, I think we need it to be able
 to do a kvm_irqchip_add_msi_route() with MSI.  Thanks,


 I am missing the point. What is that MSIMessage?
 It is just an address and data, making a struct from this is a bit too much 
 :)
 I am totally unfamiliar with kvm_irqchip_add_msi_route to see the bigger 
 picture, sorry.
 
 MSIVectorUseNotifier passes a MSIMessage back to the device when a
 vector is unmasked.  We can then add a route in KVM for that message
 with kvm_irqchip_add_msi_route.  Finally, kvm_irqchip_add_irqfd allows
 us to connect that MSI route to an eventfd, such as from virtio or vfio.
 Then MSI eventfds can bypass qemu and be injected directly into KVM and
 on into the guest.  So we seem to already have some standardization on
 passing address/data via an MSIMessage.
 
 You need a set interface, I need a get interface.  msix already has
 a static msix_get_message().  So I'd suggest that an exported
 get/set_message for each seems like the right way to go.  Thanks,

Ok. Slowly :) What QEMU tree are you talking about? git, branch?
There is neither MSIVectorUseNotifier nor MSIMessage in your or mine trees.


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] trace: added ability to comment out events in the list

2012-06-14 Thread Alexey Kardashevskiy
On 14/06/12 23:18, Stefan Hajnoczi wrote:
 On Thu, Jun 14, 2012 at 02:41:40PM +1000, Alexey Kardashevskiy wrote:
 It is convenient for debug to be able to switch on/off some events easily.
 The only possibility now is to remove event name from the file completely
 and type it again when we want it back.

 The patch adds '#' symbol handling as a comment specifier.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  trace/control.c |3 +++
  1 files changed, 3 insertions(+), 0 deletions(-)
 
 Thanks, applied to the tracing patches tree:
 https://github.com/stefanha/qemu/commits/tracing

Cannot find it there though :)



-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] msi/msix: added functions to API to set up message address, and data

2012-06-21 Thread Alexey Kardashevskiy

Ok, another try. Is it any better now? :)


Normally QEMU expects the guest to initialize MSI/MSIX vectors.
However on POWER the guest uses RTAS subsystem to configure MSI/MSIX and
does not write these vectors to device's config space or MSIX BAR.

On the other hand, msi_notify()/msix_notify() write to these vectors to
signal the guest about an interrupt so we have to write correct vectors
to the devices in order not to change every user of MSI/MSIX.

The first aim is to support MSIX for virtio-pci on POWER. There is
another patch for POWER coming which introduces a special memory region
where MSI/MSIX vectors point to.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 hw/msi.c  |   14 ++
 hw/msi.h  |1 +
 hw/msix.c |8 
 hw/msix.h |3 +++
 4 files changed, 26 insertions(+)

diff --git a/hw/msi.c b/hw/msi.c
index 5233204..c7b3e6a 100644
--- a/hw/msi.c
+++ b/hw/msi.c
@@ -363,3 +363,17 @@ unsigned int msi_nr_vectors_allocated(const PCIDevice *dev)
 uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 return msi_nr_vectors(flags);
 }
+
+void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data)
+{
+uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
+bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
+
+if (msi64bit) {
+pci_set_quad(dev-config + msi_address_lo_off(dev), address);
+} else {
+pci_set_long(dev-config + msi_address_lo_off(dev), address);
+}
+pci_set_word(dev-config + msi_data_off(dev, msi64bit), data);
+}
+
diff --git a/hw/msi.h b/hw/msi.h
index 75747ab..353386e 100644
--- a/hw/msi.h
+++ b/hw/msi.h
@@ -39,6 +39,7 @@ void msi_reset(PCIDevice *dev);
 void msi_notify(PCIDevice *dev, unsigned int vector);
 void msi_write_config(PCIDevice *dev, uint32_t addr, uint32_t val, int len);
 unsigned int msi_nr_vectors_allocated(const PCIDevice *dev);
+void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data);
 
 static inline bool msi_present(const PCIDevice *dev)
 {
diff --git a/hw/msix.c b/hw/msix.c
index ded3c55..08e773d 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -526,3 +526,11 @@ void msix_unset_vector_notifiers(PCIDevice *dev)
 dev-msix_vector_use_notifier = NULL;
 dev-msix_vector_release_notifier = NULL;
 }
+void msix_set_address_data(PCIDevice *dev, int vector,
+   uint64_t address, uint32_t data)
+{
+uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE;
+pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, address);
+pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, data);
+table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
+}
diff --git a/hw/msix.h b/hw/msix.h
index 50aee82..901f101 100644
--- a/hw/msix.h
+++ b/hw/msix.h
@@ -35,4 +35,7 @@ int msix_set_vector_notifiers(PCIDevice *dev,
   MSIVectorUseNotifier use_notifier,
   MSIVectorReleaseNotifier release_notifier);
 void msix_unset_vector_notifiers(PCIDevice *dev);
+void msix_set_address_data(PCIDevice *dev, int vector,
+   uint64_t address, uint32_t data);
+
 #endif
-- 
1.7.10


On 14/06/12 15:45, Jan Kiszka wrote:
 On 2012-06-14 07:17, Alexey Kardashevskiy wrote:
 On 14/06/12 14:56, Alex Williamson wrote:
 On Thu, 2012-06-14 at 14:31 +1000, Alexey Kardashevskiy wrote:
 Normally QEMU expects the guest to initialize MSI/MSIX vectors.
 However on POWER the guest uses RTAS subsystem to configure MSI/MSIX and
 does not write these vectors to device's config space or MSIX BAR.

 On the other hand, msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so we have to write correct vectors
 to the devices in order not to change every user of MSI/MSIX.

 The first aim is to support MSIX for virtio-pci on POWER. There is
 another patch for POWER coming which introduces a special memory region
 where MSI/MSIX vectors point to.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  hw/msi.c  |   14 ++
  hw/msi.h  |1 +
  hw/msix.c |   10 ++
  hw/msix.h |3 +++
  4 files changed, 28 insertions(+), 0 deletions(-)

 diff --git a/hw/msi.c b/hw/msi.c
 index 5d6ceb6..124878a 100644
 --- a/hw/msi.c
 +++ b/hw/msi.c
 @@ -358,3 +358,17 @@ unsigned int msi_nr_vectors_allocated(const PCIDevice 
 *dev)
  uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
  return msi_nr_vectors(flags);
  }
 +
 +void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data)
 +{
 +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 +bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
 +
 +if (msi64bit) {
 +pci_set_quad(dev-config + msi_address_lo_off(dev), address);
 +} else {
 +pci_set_long(dev-config + msi_address_lo_off(dev), address);
 +}
 +pci_set_word(dev-config + msi_data_off(dev, msi64bit), data);
 +}

 Why not make it msi_set_message

[PATCH] msi/msix: added public API to set/get MSI message address, and data

2012-06-21 Thread Alexey Kardashevskiy

agrhhh. sha1 of the patch changed after rebasing :)



Added (msi|msix)_(set|get)_message() function for whoever might
want to use them.

Currently msi_notify()/msix_notify() write to these vectors to
signal the guest about an interrupt so the correct values have to
written there by the guest or QEMU.

For example, POWER guest never initializes MSI/MSIX vectors, instead
it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
POWER we have to initialize MSI/MSIX message from QEMU.

As only set* function are required by now, the get functions were added
or made public for a symmetry.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 hw/msi.c  |   29 +
 hw/msi.h  |2 ++
 hw/msix.c |   11 ++-
 hw/msix.h |3 +++
 4 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/hw/msi.c b/hw/msi.c
index 5233204..9ad84a4 100644
--- a/hw/msi.c
+++ b/hw/msi.c
@@ -105,6 +105,35 @@ static inline uint8_t msi_pending_off(const PCIDevice* 
dev, bool msi64bit)
 return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32);
 }
 
+MSIMessage msi_get_message(PCIDevice *dev)
+{
+uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
+bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
+MSIMessage msg;
+
+if (msi64bit) {
+msg.address = pci_get_quad(dev-config + msi_address_lo_off(dev));
+} else {
+msg.address = pci_get_long(dev-config + msi_address_lo_off(dev));
+}
+msg.data = pci_get_word(dev-config + msi_data_off(dev, msi64bit));
+
+return msg;
+}
+
+void msi_set_message(PCIDevice *dev, MSIMessage msg)
+{
+uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
+bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
+
+if (msi64bit) {
+pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address);
+} else {
+pci_set_long(dev-config + msi_address_lo_off(dev), msg.address);
+}
+pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data);
+}
+
 bool msi_enabled(const PCIDevice *dev)
 {
 return msi_present(dev) 
diff --git a/hw/msi.h b/hw/msi.h
index 75747ab..4b0f4f8 100644
--- a/hw/msi.h
+++ b/hw/msi.h
@@ -31,6 +31,8 @@ struct MSIMessage {
 
 extern bool msi_supported;
 
+MSIMessage msi_get_message(PCIDevice *dev);
+void msi_set_message(PCIDevice *dev, MSIMessage msg);
 bool msi_enabled(const PCIDevice *dev);
 int msi_init(struct PCIDevice *dev, uint8_t offset,
  unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask);
diff --git a/hw/msix.c b/hw/msix.c
index ded3c55..9e8d8bb 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -35,7 +35,7 @@
 #define MSIX_PAGE_PENDING (MSIX_PAGE_SIZE / 2)
 #define MSIX_MAX_ENTRIES 32
 
-static MSIMessage msix_get_message(PCIDevice *dev, unsigned vector)
+MSIMessage msix_get_message(PCIDevice *dev, unsigned vector)
 {
 uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE;
 MSIMessage msg;
@@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned 
vector)
 return msg;
 }
 
+void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg)
+{
+uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE;
+
+pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address);
+pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data);
+table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
+}
+
 /* Add MSI-X capability to the config space for the device. */
 /* Given a bar and its size, add MSI-X table on top of it
  * and fill MSI-X capability in the config space.
diff --git a/hw/msix.h b/hw/msix.h
index 50aee82..3374cf8 100644
--- a/hw/msix.h
+++ b/hw/msix.h
@@ -4,6 +4,9 @@
 #include qemu-common.h
 #include pci.h
 
+MSIMessage msix_get_message(PCIDevice *dev, unsigned vector);
+void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg);
+
 int msix_init(PCIDevice *pdev, unsigned short nentries,
   MemoryRegion *bar,
   unsigned bar_nr, unsigned bar_size);
-- 
1.7.10



On 21/06/12 16:53, Jan Kiszka wrote:
 On 2012-06-21 08:46, Alexey Kardashevskiy wrote:

 Ok, another try. Is it any better now? :)
 
 No - posted the old version accidentally?
 
 Jan
 


 Normally QEMU expects the guest to initialize MSI/MSIX vectors.
 However on POWER the guest uses RTAS subsystem to configure MSI/MSIX and
 does not write these vectors to device's config space or MSIX BAR.

 On the other hand, msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so we have to write correct vectors
 to the devices in order not to change every user of MSI/MSIX.

 The first aim is to support MSIX for virtio-pci on POWER. There is
 another patch for POWER coming which introduces a special memory region
 where MSI/MSIX vectors point to.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  hw/msi.c  |   14 ++
  hw/msi.h  |1 +
  hw/msix.c |8

Re: [PATCH] msi/msix: added public API to set/get MSI message address, and data

2012-06-21 Thread Alexey Kardashevskiy
On 21/06/12 20:38, Jan Kiszka wrote:
 On 2012-06-21 12:28, Alexey Kardashevskiy wrote:
 On 21/06/12 17:39, Jan Kiszka wrote:
 On 2012-06-21 09:18, Alexey Kardashevskiy wrote:

 agrhhh. sha1 of the patch changed after rebasing :)



 Added (msi|msix)_(set|get)_message() function for whoever might
 want to use them.

 Currently msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so the correct values have to
 written there by the guest or QEMU.

 For example, POWER guest never initializes MSI/MSIX vectors, instead
 it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
 POWER we have to initialize MSI/MSIX message from QEMU.

 As only set* function are required by now, the get functions were added
 or made public for a symmetry.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  hw/msi.c  |   29 +
  hw/msi.h  |2 ++
  hw/msix.c |   11 ++-
  hw/msix.h |3 +++
  4 files changed, 44 insertions(+), 1 deletion(-)

 diff --git a/hw/msi.c b/hw/msi.c
 index 5233204..9ad84a4 100644
 --- a/hw/msi.c
 +++ b/hw/msi.c
 @@ -105,6 +105,35 @@ static inline uint8_t msi_pending_off(const 
 PCIDevice* dev, bool msi64bit)
  return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : 
 PCI_MSI_PENDING_32);
  }
  
 +MSIMessage msi_get_message(PCIDevice *dev)

 MSIMessage msi_get_message(PCIDevice *dev, unsigned vector)


 Who/how/why is going to calculate the vector here?


 +{
 +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 +bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
 +MSIMessage msg;
 +
 +if (msi64bit) {
 +msg.address = pci_get_quad(dev-config + msi_address_lo_off(dev));
 +} else {
 +msg.address = pci_get_long(dev-config + msi_address_lo_off(dev));
 +}
 +msg.data = pci_get_word(dev-config + msi_data_off(dev, msi64bit));

 And I have this here in addition:

 unsigned int nr_vectors = msi_nr_vectors(flags);
 ...

 if (nr_vectors  1) {
 msg.data = ~(nr_vectors - 1);
 msg.data |= vector;
 }

 See PCI spec and existing code.


 What for? I really do not get it why someone might want to read something 
 but not real value.
 What PCI code should I look?
 
 I'm not sure what your use case for reading the message is. For KVM
 device assignment it is preparing an alternative message delivery path
 for MSI vectors. And for this we will need vector notifier support for
 MSI as well. You can check the MSI-X code for corresponding use cases of
 msix_get_message.

 And when we already have msi_get_message, another logical use case is
 msi_notify. See msix.c again.

Aaaa.

I have no case for reading the message. All I need is writing. And I want it 
public as I want to use
it from hw/spapr_pci.c. You suggested to add reading, I added get to be 
_symmetric_ to set
(get returns what set wrote). You want a different thing which I can do but 
it is not
msi_get_message(), it is something like msi_prepare_message(MSImessage msg) or
msi_set_vector(uint16_t data) or simply internal kitchen of msi_notify().

Still can do what you suggested, it just does not seem right.


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] msi/msix: added API to set MSI message address and data

2012-06-21 Thread Alexey Kardashevskiy
Added (msi|msix)_set_message() functions.

Currently msi_notify()/msix_notify() write to these vectors to
signal the guest about an interrupt so the correct values have to
written there by the guest or QEMU.

For example, POWER guest never initializes MSI/MSIX vectors, instead
it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
POWER we have to initialize MSI/MSIX message from QEMU.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 hw/msi.c  |   13 +
 hw/msi.h  |1 +
 hw/msix.c |9 +
 hw/msix.h |2 ++
 4 files changed, 25 insertions(+)

diff --git a/hw/msi.c b/hw/msi.c
index 5233204..cc6102f 100644
--- a/hw/msi.c
+++ b/hw/msi.c
@@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* 
dev, bool msi64bit)
 return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32);
 }
 
+void msi_set_message(PCIDevice *dev, MSIMessage msg)
+{
+uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
+bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
+
+if (msi64bit) {
+pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address);
+} else {
+pci_set_long(dev-config + msi_address_lo_off(dev), msg.address);
+}
+pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data);
+}
+
 bool msi_enabled(const PCIDevice *dev)
 {
 return msi_present(dev) 
diff --git a/hw/msi.h b/hw/msi.h
index 75747ab..6ec1f99 100644
--- a/hw/msi.h
+++ b/hw/msi.h
@@ -31,6 +31,7 @@ struct MSIMessage {
 
 extern bool msi_supported;
 
+void msi_set_message(PCIDevice *dev, MSIMessage msg);
 bool msi_enabled(const PCIDevice *dev);
 int msi_init(struct PCIDevice *dev, uint8_t offset,
  unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask);
diff --git a/hw/msix.c b/hw/msix.c
index ded3c55..5f7d6d3 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned 
vector)
 return msg;
 }
 
+void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg)
+{
+uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE;
+
+pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address);
+pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data);
+table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
+}
+
 /* Add MSI-X capability to the config space for the device. */
 /* Given a bar and its size, add MSI-X table on top of it
  * and fill MSI-X capability in the config space.
diff --git a/hw/msix.h b/hw/msix.h
index 50aee82..26a437e 100644
--- a/hw/msix.h
+++ b/hw/msix.h
@@ -4,6 +4,8 @@
 #include qemu-common.h
 #include pci.h
 
+void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg);
+
 int msix_init(PCIDevice *pdev, unsigned short nentries,
   MemoryRegion *bar,
   unsigned bar_nr, unsigned bar_size);
-- 
1.7.10

ps. double '-' and git version is an end-of-patch scissor as I read somewhere, 
cannot recall where exactly :)






On 21/06/12 20:56, Jan Kiszka wrote:
 On 2012-06-21 12:50, Alexey Kardashevskiy wrote:
 On 21/06/12 20:38, Jan Kiszka wrote:
 On 2012-06-21 12:28, Alexey Kardashevskiy wrote:
 On 21/06/12 17:39, Jan Kiszka wrote:
 On 2012-06-21 09:18, Alexey Kardashevskiy wrote:

 agrhhh. sha1 of the patch changed after rebasing :)



 Added (msi|msix)_(set|get)_message() function for whoever might
 want to use them.

 Currently msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so the correct values have to
 written there by the guest or QEMU.

 For example, POWER guest never initializes MSI/MSIX vectors, instead
 it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
 POWER we have to initialize MSI/MSIX message from QEMU.

 As only set* function are required by now, the get functions were added
 or made public for a symmetry.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  hw/msi.c  |   29 +
  hw/msi.h  |2 ++
  hw/msix.c |   11 ++-
  hw/msix.h |3 +++
  4 files changed, 44 insertions(+), 1 deletion(-)

 diff --git a/hw/msi.c b/hw/msi.c
 index 5233204..9ad84a4 100644
 --- a/hw/msi.c
 +++ b/hw/msi.c
 @@ -105,6 +105,35 @@ static inline uint8_t msi_pending_off(const 
 PCIDevice* dev, bool msi64bit)
  return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : 
 PCI_MSI_PENDING_32);
  }
  
 +MSIMessage msi_get_message(PCIDevice *dev)

 MSIMessage msi_get_message(PCIDevice *dev, unsigned vector)


 Who/how/why is going to calculate the vector here?


 +{
 +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 +bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
 +MSIMessage msg;
 +
 +if (msi64bit) {
 +msg.address = pci_get_quad(dev-config + 
 msi_address_lo_off(dev));
 +} else {
 +msg.address = pci_get_long(dev-config + 
 msi_address_lo_off(dev));
 +}
 +msg.data = pci_get_word(dev-config

Re: [PATCH] msi/msix: added API to set MSI message address and data

2012-06-21 Thread Alexey Kardashevskiy
On 21/06/12 21:49, Jan Kiszka wrote:
 On 2012-06-21 13:39, Alexey Kardashevskiy wrote:
 Added (msi|msix)_set_message() functions.

 Currently msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so the correct values have to
 written there by the guest or QEMU.

 For example, POWER guest never initializes MSI/MSIX vectors, instead
 it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
 POWER we have to initialize MSI/MSIX message from QEMU.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  hw/msi.c  |   13 +
  hw/msi.h  |1 +
  hw/msix.c |9 +
  hw/msix.h |2 ++
  4 files changed, 25 insertions(+)

 diff --git a/hw/msi.c b/hw/msi.c
 index 5233204..cc6102f 100644
 --- a/hw/msi.c
 +++ b/hw/msi.c
 @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* 
 dev, bool msi64bit)
  return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : 
 PCI_MSI_PENDING_32);
  }
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg)
 +{
 +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 +bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
 +
 +if (msi64bit) {
 +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address);
 +} else {
 +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address);
 +}
 +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data);
 +}
 +
  bool msi_enabled(const PCIDevice *dev)
  {
  return msi_present(dev) 
 diff --git a/hw/msi.h b/hw/msi.h
 index 75747ab..6ec1f99 100644
 --- a/hw/msi.h
 +++ b/hw/msi.h
 @@ -31,6 +31,7 @@ struct MSIMessage {
  
  extern bool msi_supported;
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg);
  bool msi_enabled(const PCIDevice *dev);
  int msi_init(struct PCIDevice *dev, uint8_t offset,
   unsigned int nr_vectors, bool msi64bit, bool 
 msi_per_vector_mask);
 diff --git a/hw/msix.c b/hw/msix.c
 index ded3c55..5f7d6d3 100644
 --- a/hw/msix.c
 +++ b/hw/msix.c
 @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, 
 unsigned vector)
  return msg;
  }
  
 +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg)
 +{
 +uint8_t *table_entry = dev-msix_table_page + vector * 
 PCI_MSIX_ENTRY_SIZE;
 +
 +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address);
 +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data);
 +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
 +}
 +
  /* Add MSI-X capability to the config space for the device. */
  /* Given a bar and its size, add MSI-X table on top of it
   * and fill MSI-X capability in the config space.
 diff --git a/hw/msix.h b/hw/msix.h
 index 50aee82..26a437e 100644
 --- a/hw/msix.h
 +++ b/hw/msix.h
 @@ -4,6 +4,8 @@
  #include qemu-common.h
  #include pci.h
  
 +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg);
 +
  int msix_init(PCIDevice *pdev, unsigned short nentries,
MemoryRegion *bar,
unsigned bar_nr, unsigned bar_size);

 
 Interface looks good as fas as I can tell (can't asses the POWER need
 for clearing the mask bit on msix_set_message).


I do not know exactly how x86 works (who/how allocates addresses for MSI/MSIX). 
On POWER at the
moment I did the following thing in QEMU:

- registered memory_region_init_io at some big address which the guest won't 
use, it is just for QEMU
- put address from the previous step to the MSIX BAR via msix_set_message() 
when msi is being configured
- then the sequence looks like:
- vfio_msi_interrupt() calls msix_notify()
- msix_notify() checks if it is masked via msix_is_masked() - and here 
PCI_MSIX_ENTRY_CTRL_MASKBIT
must be unset
- stl_le_phys() - here I get a notification in my 
MemoryRegionOps::write() and do qemu_irq_pulse()

2 reasons to do that:
1) I did not have to change either msix or vfio - cool for submitting patches;
2) neither POWER guest or qemu changes the msi or msix PCI config (it is done 
by different mechanism
called RTAS), so I have to do this myself to support 1) and I do not have to 
care about someone
breaking my settings


 -- 
 1.7.10

 ps. double '-' and git version is an end-of-patch scissor as I read 
 somewhere, cannot recall where exactly 
 
 Check man git-am.

Ahhh. Confused end-of-message with end-of-patch. I'll repost it.



-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] msi/msix: added API to set MSI message address and data

2012-06-21 Thread Alexey Kardashevskiy
Added (msi|msix)_set_message() function for whoever might
want to use them.

Currently msi_notify()/msix_notify() write to these vectors to
signal the guest about an interrupt so the correct values have to
written there by the guest or QEMU.

For example, POWER guest never initializes MSI/MSIX vectors, instead
it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
POWER we have to initialize MSI/MSIX message from QEMU.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 hw/msi.c  |   13 +
 hw/msi.h  |1 +
 hw/msix.c |9 +
 hw/msix.h |2 ++
 4 files changed, 25 insertions(+)

diff --git a/hw/msi.c b/hw/msi.c
index 5233204..cc6102f 100644
--- a/hw/msi.c
+++ b/hw/msi.c
@@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* 
dev, bool msi64bit)
 return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32);
 }
 
+void msi_set_message(PCIDevice *dev, MSIMessage msg)
+{
+uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
+bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
+
+if (msi64bit) {
+pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address);
+} else {
+pci_set_long(dev-config + msi_address_lo_off(dev), msg.address);
+}
+pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data);
+}
+
 bool msi_enabled(const PCIDevice *dev)
 {
 return msi_present(dev) 
diff --git a/hw/msi.h b/hw/msi.h
index 75747ab..6ec1f99 100644
--- a/hw/msi.h
+++ b/hw/msi.h
@@ -31,6 +31,7 @@ struct MSIMessage {
 
 extern bool msi_supported;
 
+void msi_set_message(PCIDevice *dev, MSIMessage msg);
 bool msi_enabled(const PCIDevice *dev);
 int msi_init(struct PCIDevice *dev, uint8_t offset,
  unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask);
diff --git a/hw/msix.c b/hw/msix.c
index ded3c55..5f7d6d3 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned 
vector)
 return msg;
 }
 
+void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg)
+{
+uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE;
+
+pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address);
+pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data);
+table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
+}
+
 /* Add MSI-X capability to the config space for the device. */
 /* Given a bar and its size, add MSI-X table on top of it
  * and fill MSI-X capability in the config space.
diff --git a/hw/msix.h b/hw/msix.h
index 50aee82..26a437e 100644
--- a/hw/msix.h
+++ b/hw/msix.h
@@ -4,6 +4,8 @@
 #include qemu-common.h
 #include pci.h
 
+void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg);
+
 int msix_init(PCIDevice *pdev, unsigned short nentries,
   MemoryRegion *bar,
   unsigned bar_nr, unsigned bar_size);
-- 
1.7.10
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] msi/msix: added API to set MSI message address and data

2012-07-01 Thread Alexey Kardashevskiy
Ping?


On 22/06/12 11:15, Alexey Kardashevskiy wrote:
 Added (msi|msix)_set_message() function for whoever might
 want to use them.
 
 Currently msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so the correct values have to
 written there by the guest or QEMU.
 
 For example, POWER guest never initializes MSI/MSIX vectors, instead
 it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
 POWER we have to initialize MSI/MSIX message from QEMU.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  hw/msi.c  |   13 +
  hw/msi.h  |1 +
  hw/msix.c |9 +
  hw/msix.h |2 ++
  4 files changed, 25 insertions(+)
 
 diff --git a/hw/msi.c b/hw/msi.c
 index 5233204..cc6102f 100644
 --- a/hw/msi.c
 +++ b/hw/msi.c
 @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* 
 dev, bool msi64bit)
  return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : 
 PCI_MSI_PENDING_32);
  }
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg)
 +{
 +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 +bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
 +
 +if (msi64bit) {
 +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address);
 +} else {
 +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address);
 +}
 +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data);
 +}
 +
  bool msi_enabled(const PCIDevice *dev)
  {
  return msi_present(dev) 
 diff --git a/hw/msi.h b/hw/msi.h
 index 75747ab..6ec1f99 100644
 --- a/hw/msi.h
 +++ b/hw/msi.h
 @@ -31,6 +31,7 @@ struct MSIMessage {
  
  extern bool msi_supported;
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg);
  bool msi_enabled(const PCIDevice *dev);
  int msi_init(struct PCIDevice *dev, uint8_t offset,
   unsigned int nr_vectors, bool msi64bit, bool 
 msi_per_vector_mask);
 diff --git a/hw/msix.c b/hw/msix.c
 index ded3c55..5f7d6d3 100644
 --- a/hw/msix.c
 +++ b/hw/msix.c
 @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, 
 unsigned vector)
  return msg;
  }
  
 +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg)
 +{
 +uint8_t *table_entry = dev-msix_table_page + vector * 
 PCI_MSIX_ENTRY_SIZE;
 +
 +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address);
 +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data);
 +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
 +}
 +
  /* Add MSI-X capability to the config space for the device. */
  /* Given a bar and its size, add MSI-X table on top of it
   * and fill MSI-X capability in the config space.
 diff --git a/hw/msix.h b/hw/msix.h
 index 50aee82..26a437e 100644
 --- a/hw/msix.h
 +++ b/hw/msix.h
 @@ -4,6 +4,8 @@
  #include qemu-common.h
  #include pci.h
  
 +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg);
 +
  int msix_init(PCIDevice *pdev, unsigned short nentries,
MemoryRegion *bar,
unsigned bar_nr, unsigned bar_size);


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] msi/msix: added API to set MSI message address and data

2012-07-18 Thread Alexey Kardashevskiy
On 18/07/12 22:43, Michael S. Tsirkin wrote:
 On Thu, Jun 21, 2012 at 09:39:10PM +1000, Alexey Kardashevskiy wrote:
 Added (msi|msix)_set_message() functions.

 Currently msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so the correct values have to
 written there by the guest or QEMU.

 For example, POWER guest never initializes MSI/MSIX vectors, instead
 it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
 POWER we have to initialize MSI/MSIX message from QEMU.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 
 So guests do enable MSI through config space, but do
 not fill in vectors? 

Yes. msix_capability_init() calls arch_setup_msi_irqs() which does everything 
it needs to do (i.e. calls hypervisor) before msix_capability_init() writes 
PCI_MSIX_FLAGS_ENABLE to the PCI_MSIX_FLAGS register.

These vectors are the PCI bus addresses, the way they are set is specific for a 
PCI host controller, I do not see why the current scheme is a bug.


 Very strange. Are you sure it's not
 just a guest bug? How does it work for other PCI devices?

Did not get the question. It works the same for every PCI device under POWER 
guest.


 Can't we just fix guest drivers to program the vectors properly?
 
 Also pls address the comment below.

Comment below.

 Thanks!
 
 ---
  hw/msi.c  |   13 +
  hw/msi.h  |1 +
  hw/msix.c |9 +
  hw/msix.h |2 ++
  4 files changed, 25 insertions(+)

 diff --git a/hw/msi.c b/hw/msi.c
 index 5233204..cc6102f 100644
 --- a/hw/msi.c
 +++ b/hw/msi.c
 @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* 
 dev, bool msi64bit)
  return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : 
 PCI_MSI_PENDING_32);
  }
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg)
 +{
 +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 +bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
 +
 +if (msi64bit) {
 +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address);
 +} else {
 +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address);
 +}
 +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data);
 +}
 +
 
 Please add documentation. Something like
 
 /*
  * Special API for POWER to configure the vectors through
  * a side channel. Should never be used by devices.
  */


It is useful for any para-virtualized environment I believe, is not it?
For s390 as well. Of course, if it supports PCI, for example, what I am not 
sure it does though :)



  bool msi_enabled(const PCIDevice *dev)
  {
  return msi_present(dev) 
 diff --git a/hw/msi.h b/hw/msi.h
 index 75747ab..6ec1f99 100644
 --- a/hw/msi.h
 +++ b/hw/msi.h
 @@ -31,6 +31,7 @@ struct MSIMessage {
  
  extern bool msi_supported;
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg);
  bool msi_enabled(const PCIDevice *dev);
  int msi_init(struct PCIDevice *dev, uint8_t offset,
   unsigned int nr_vectors, bool msi64bit, bool 
 msi_per_vector_mask);
 diff --git a/hw/msix.c b/hw/msix.c
 index ded3c55..5f7d6d3 100644
 --- a/hw/msix.c
 +++ b/hw/msix.c
 @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, 
 unsigned vector)
  return msg;
  }
  
 +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg)
 +{
 +uint8_t *table_entry = dev-msix_table_page + vector * 
 PCI_MSIX_ENTRY_SIZE;
 +
 +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address);
 +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data);
 +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
 +}
 +
  /* Add MSI-X capability to the config space for the device. */
  /* Given a bar and its size, add MSI-X table on top of it
   * and fill MSI-X capability in the config space.
 diff --git a/hw/msix.h b/hw/msix.h
 index 50aee82..26a437e 100644
 --- a/hw/msix.h
 +++ b/hw/msix.h
 @@ -4,6 +4,8 @@
  #include qemu-common.h
  #include pci.h
  
 +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg);
 +
  int msix_init(PCIDevice *pdev, unsigned short nentries,
MemoryRegion *bar,
unsigned bar_nr, unsigned bar_size);
 -- 
 1.7.10

 ps. double '-' and git version is an end-of-patch scissor as I read 
 somewhere, cannot recall where exactly :)






 On 21/06/12 20:56, Jan Kiszka wrote:
 On 2012-06-21 12:50, Alexey Kardashevskiy wrote:
 On 21/06/12 20:38, Jan Kiszka wrote:
 On 2012-06-21 12:28, Alexey Kardashevskiy wrote:
 On 21/06/12 17:39, Jan Kiszka wrote:
 On 2012-06-21 09:18, Alexey Kardashevskiy wrote:

 agrhhh. sha1 of the patch changed after rebasing :)



 Added (msi|msix)_(set|get)_message() function for whoever might
 want to use them.

 Currently msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so the correct values have to
 written there by the guest or QEMU.

 For example, POWER guest never initializes MSI/MSIX vectors, instead

Re: [PATCH] msi/msix: added API to set MSI message address and data

2012-07-18 Thread Alexey Kardashevskiy
On 19/07/12 01:23, Michael S. Tsirkin wrote:
 On Wed, Jul 18, 2012 at 11:17:12PM +1000, Alexey Kardashevskiy wrote:
 On 18/07/12 22:43, Michael S. Tsirkin wrote:
 On Thu, Jun 21, 2012 at 09:39:10PM +1000, Alexey Kardashevskiy wrote:
 Added (msi|msix)_set_message() functions.

 Currently msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so the correct values have to
 written there by the guest or QEMU.

 For example, POWER guest never initializes MSI/MSIX vectors, instead
 it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
 POWER we have to initialize MSI/MSIX message from QEMU.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

 So guests do enable MSI through config space, but do
 not fill in vectors? 

 Yes. msix_capability_init() calls arch_setup_msi_irqs() which does 
 everything it needs to do (i.e. calls hypervisor) before 
 msix_capability_init() writes PCI_MSIX_FLAGS_ENABLE to the PCI_MSIX_FLAGS 
 register.

 These vectors are the PCI bus addresses, the way they are set is specific 
 for a PCI host controller, I do not see why the current scheme is a bug.
 
 I won't work with any real PCI device, will it? Real pci devices expect
 vectors to be written into their memory.


Yes. And the hypervisor does this. On POWER (at least book3s - server powerpc, 
the whole config space kitchen is hidden behind RTAS (kind of bios). For the 
guest, this RTAS is implemented in hypervisor, for the host - in the system 
firmware. So powerpc linux does not have to have PHB drivers. Kinda cool.

Usual powerpc server is running without the host linux at all, it is running a 
hypervisor called pHyp. And every guest knows that it is a guest, there is no 
full machine emulation, it is para-virtualization. In power-kvm, we replace 
that pHyp with the host linux and now QEMU plays a hypervisor role. Some day We 
will move the hypervisor to the host kernel completely (?) but now it is in 
QEMU.


 Very strange. Are you sure it's not
 just a guest bug? How does it work for other PCI devices?

 Did not get the question. It works the same for every PCI device under POWER 
 guest.
 
 I mean for real PCI devices.
 
 Can't we just fix guest drivers to program the vectors properly?

 Also pls address the comment below.

 Comment below.

 Thanks!

 ---
  hw/msi.c  |   13 +
  hw/msi.h  |1 +
  hw/msix.c |9 +
  hw/msix.h |2 ++
  4 files changed, 25 insertions(+)

 diff --git a/hw/msi.c b/hw/msi.c
 index 5233204..cc6102f 100644
 --- a/hw/msi.c
 +++ b/hw/msi.c
 @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const 
 PCIDevice* dev, bool msi64bit)
  return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : 
 PCI_MSI_PENDING_32);
  }
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg)
 +{
 +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 +bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
 +
 +if (msi64bit) {
 +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address);
 +} else {
 +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address);
 +}
 +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data);
 +}
 +

 Please add documentation. Something like

 /*
  * Special API for POWER to configure the vectors through
  * a side channel. Should never be used by devices.
  */


 It is useful for any para-virtualized environment I believe, is not it?
 For s390 as well. Of course, if it supports PCI, for example, what I am not 
 sure it does though :)
 
 I expect the normal guest to program the address into MSI register using
 config accesses, same way that it enables MSI/MSIX.
 Why POWER does it differently I did not yet figure out but I hope
 this weirdness is not so widespread.


In para-virt I would expect the guest not to touch config space at all. At 
least it should use one interface rather than two but this is how it is.


  bool msi_enabled(const PCIDevice *dev)
  {
  return msi_present(dev) 
 diff --git a/hw/msi.h b/hw/msi.h
 index 75747ab..6ec1f99 100644
 --- a/hw/msi.h
 +++ b/hw/msi.h
 @@ -31,6 +31,7 @@ struct MSIMessage {
  
  extern bool msi_supported;
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg);
  bool msi_enabled(const PCIDevice *dev);
  int msi_init(struct PCIDevice *dev, uint8_t offset,
   unsigned int nr_vectors, bool msi64bit, bool 
 msi_per_vector_mask);
 diff --git a/hw/msix.c b/hw/msix.c
 index ded3c55..5f7d6d3 100644
 --- a/hw/msix.c
 +++ b/hw/msix.c
 @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, 
 unsigned vector)
  return msg;
  }
  
 +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg)
 +{
 +uint8_t *table_entry = dev-msix_table_page + vector * 
 PCI_MSIX_ENTRY_SIZE;
 +
 +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address);
 +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data);
 +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL

[PATCH] powerpc-kvm: fixing page alignment for TCE

2012-09-04 Thread Alexey Kardashevskiy
From: Paul Mackerras pau...@samba.org

TODO: ask Paul to make a proper message.

This is the fix for a host kernel compiled with a page size
other than 4K (TCE page size). In the case of a 64K page size,
the host used to lose address bits in hpte_rpn().
The patch fixes it.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c |9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 80a5775..a41f11b 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -503,7 +503,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
struct kvm *kvm = vcpu-kvm;
unsigned long *hptep, hpte[3], r;
unsigned long mmu_seq, psize, pte_size;
-   unsigned long gfn, hva, pfn;
+   unsigned long gpa, gfn, hva, pfn;
struct kvm_memory_slot *memslot;
unsigned long *rmap;
struct revmap_entry *rev;
@@ -541,15 +541,14 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
 
/* Translate the logical address and get the page */
psize = hpte_page_size(hpte[0], r);
-   gfn = hpte_rpn(r, psize);
+   gpa = (r  HPTE_R_RPN  ~(psize - 1)) | (ea  (psize - 1));
+   gfn = gpa  PAGE_SHIFT;
memslot = gfn_to_memslot(kvm, gfn);
 
/* No memslot means it's an emulated MMIO region */
-   if (!memslot || (memslot-flags  KVM_MEMSLOT_INVALID)) {
-   unsigned long gpa = (gfn  PAGE_SHIFT) | (ea  (psize - 1));
+   if (!memslot || (memslot-flags  KVM_MEMSLOT_INVALID))
return kvmppc_hv_emulate_mmio(run, vcpu, gpa, ea,
  dsisr  DSISR_ISSTORE);
-   }
 
if (!kvm-arch.using_mmu_notifiers)
return -EFAULT; /* should never get here */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] iommu: adding missing kvm_iommu_map_pages/kvm_iommu_unmap_pages

2013-02-10 Thread Alexey Kardashevskiy
The IOMMU API implements groups creating/deletion, device binding
and IOMMU map/unmap operations.

The POWERPC implementation uses most of the API except map/unmap
operations which are implemented on POWERPC using hypercalls.

However in order to link a kernel with the CONFIG_IOMMU_API enabled,
the empty kvm_iommu_map_pages/kvm_iommu_unmap_pages have to be
defined, so does the patch.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Cc: David Gibson da...@gibson.dropbear.id.au
---
 arch/powerpc/kernel/iommu.c |   17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 31c4fdc..7c309fe 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -36,6 +36,7 @@
 #include linux/hash.h
 #include linux/fault-inject.h
 #include linux/pci.h
+#include linux/kvm_host.h
 #include asm/io.h
 #include asm/prom.h
 #include asm/iommu.h
@@ -860,3 +861,19 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
size,
free_pages((unsigned long)vaddr, get_order(size));
}
 }
+
+#ifdef CONFIG_IOMMU_API
+/*
+ * SPAPR TCE API
+ */
+
+/* POWERPC does not use IOMMU API for mapping/unmapping */
+int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+   return 0;
+}
+void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+}
+
+#endif /* CONFIG_IOMMU_API */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] powerpc kvm: added multiple TCEs requests support

2013-02-18 Thread Alexey Kardashevskiy

On 15/02/13 14:24, Paul Mackerras wrote:

On Mon, Feb 11, 2013 at 11:12:41PM +1100, a...@ozlabs.ru wrote:


+static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
+   unsigned long ioba, unsigned long tce)
+{
+   unsigned long idx = ioba  SPAPR_TCE_SHIFT;
+   struct page *page;
+   u64 *tbl;
+
+   /* udbg_printf(H_PUT_TCE: liobn 0x%lx = stt=%p  window_size=0x%x\n, 
*/
+   /*  liobn, stt, stt-window_size); */
+   if (ioba = stt-window_size) {
+   pr_err(%s failed on ioba=%lx\n, __func__, ioba);


Doesn't this give the guest a way to spam the host logs?  And in fact
printk in real mode is potentially problematic.  I would just leave
out this statement.


+   return H_PARAMETER;
+   }
+
+   page = stt-pages[idx / TCES_PER_PAGE];
+   tbl = (u64 *)page_address(page);


I would like to see an explanation of why we are confident that
page_address() will work correctly in real mode, across all the
combinations of config options that we can have for a ppc64 book3s
kernel.


It was there before this patch, I just moved it so I would think it has 
been explained before :)


There is no combination on PPC to get WANT_PAGE_VIRTUAL enabled.
CONFIG_HIGHMEM is supported for PPC32 only so HASHED_PAGE_VIRTUAL is not 
enabled on PPC64 either.


So this definition is supposed to work on PPC64:
#define page_address(page) lowmem_page_address(page)

where lowmem_page_address() is arithmetic operation on a page struct address:
static __always_inline void *lowmem_page_address(const struct page *page)
{
return __va(PFN_PHYS(page_to_pfn(page)));
}

PPC32 will use page_address() from mm/highmem.c, I need some lesson about 
memory layout in 32bit but for now I cannot see how it can possibly fail here.





+
+   /* FIXME: Need to validate the TCE itself */
+   /* udbg_printf(tce @ %p\n, tbl[idx % TCES_PER_PAGE]); */
+   tbl[idx % TCES_PER_PAGE] = tce;
+
+   return H_SUCCESS;
+}
+
+/*
+ * Real mode handlers
   */
  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba, unsigned long tce)
  {
-   struct kvm *kvm = vcpu-kvm;
struct kvmppc_spapr_tce_table *stt;

-   /* udbg_printf(H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n, */
-   /*  liobn, ioba, tce); */
+   stt = find_tce_table(vcpu, liobn);
+   /* Didn't find the liobn, put it to userspace */
+   if (!stt)
+   return H_TOO_HARD;
+
+   /* Emulated IO */
+   return emulated_h_put_tce(stt, ioba, tce);
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_list, unsigned long npages)
+{
+   struct kvmppc_spapr_tce_table *stt;
+   long i, ret = 0;
+   unsigned long *tces;
+
+   stt = find_tce_table(vcpu, liobn);
+   /* Didn't find the liobn, put it to userspace */
+   if (!stt)
+   return H_TOO_HARD;

-   list_for_each_entry(stt, kvm-arch.spapr_tce_tables, list) {
-   if (stt-liobn == liobn) {
-   unsigned long idx = ioba  SPAPR_TCE_SHIFT;
-   struct page *page;
-   u64 *tbl;
+   tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
+   if (!tces)
+   return H_TOO_HARD;

-   /* udbg_printf(H_PUT_TCE: liobn 0x%lx = stt=%p  
window_size=0x%x\n, */
-   /*  liobn, stt, stt-window_size); */
-   if (ioba = stt-window_size)
-   return H_PARAMETER;
+   /* Emulated IO */
+   for (i = 0; (i  npages)  !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+   ret = emulated_h_put_tce(stt, ioba, tces[i]);


So, tces is a pointer to somewhere inside a real page.  Did we check
somewhere that tces[npages-1] is in the same page as tces[0]?  If so,
I missed it.  If we didn't, then we probably should check and do
something about it.



-   page = stt-pages[idx / TCES_PER_PAGE];
-   tbl = (u64 *)page_address(page);
+   return ret;
+}

-   /* FIXME: Need to validate the TCE itself */
-   /* udbg_printf(tce @ %p\n, tbl[idx % 
TCES_PER_PAGE]); */
-   tbl[idx % TCES_PER_PAGE] = tce;
-   return H_SUCCESS;
-   }
-   }
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_value, unsigned long npages)
+{
+   struct kvmppc_spapr_tce_table *stt;
+   long i, ret = 0;
+
+   stt = find_tce_table(vcpu, liobn);
+   /* Didn't find the liobn, put it to userspace */
+   if (!stt)
+   return H_TOO_HARD;

-   /* Didn't find the liobn, punt it to userspace */
-   return H_TOO_HARD;
+   /* Emulated 

[PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls

2013-05-06 Thread Alexey Kardashevskiy
This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the hcall-multi-tce
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Signed-off-by: Paul Mackerras pau...@samba.org
---
 Documentation/virtual/kvm/api.txt   |   15 ++
 arch/powerpc/include/asm/kvm_ppc.h  |   15 +-
 arch/powerpc/kvm/book3s_64_vio.c|  114 +++
 arch/powerpc/kvm/book3s_64_vio_hv.c |  231 +++
 arch/powerpc/kvm/book3s_hv.c|   23 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 +
 arch/powerpc/kvm/book3s_pr_papr.c   |   37 -
 arch/powerpc/kvm/powerpc.c  |3 +
 include/uapi/linux/kvm.h|1 +
 9 files changed, 413 insertions(+), 32 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index a4df553..f621cd6 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2463,3 +2463,18 @@ For mmu types KVM_MMU_FSL_BOOKE_NOHV and 
KVM_MMU_FSL_BOOKE_HV:
where num_sets is the tlb_sizes[] value divided by the tlb_ways[] value.
  - The tsize field of mas1 shall be set to 4K on TLB0, even though the
hardware ignores this value for TLB0.
+
+
+6.4 KVM_CAP_PPC_MULTITCE
+
+Architectures: ppc
+Parameters: none
+Returns: 0 on success; -1 on error
+
+This capability enables the guest to put/remove multiple TCE entries
+per hypercall which significanly accelerates DMA operations for PPC KVM
+guests.
+
+When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
+expected to occur rather than H_PUT_TCE which supports only one TCE entry
+per call.
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 99da298..d501246 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -139,8 +139,19 @@ extern void kvmppc_xics_free(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+   struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
+   unsigned long ioba, unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 72ffc89..643ac1e 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. pau...@au1.ibm.com
  * Copyright 2011 David Gibson, IBM Corporation d...@au1.ibm.com
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation a...@au1.ibm.com
  */
 
 #include linux/types.h
@@ -36,9 +37,14 @@
 #include asm/ppc-opcode.h
 #include asm/kvm_host.h
 #include asm/udbg.h
+#include asm/iommu.h
 
 #define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR  (~(unsigned long)0x0)
 
+/*
+ * TCE tables handlers.
+ */
 static long kvmppc_stt_npages(unsigned long window_size)
 {
return ALIGN((window_size  SPAPR_TCE_SHIFT)
@@ -148,3 +154,111 @@ fail:
}
return ret;
 }
+
+/*
+ * Virtual mode handling of IOMMU map/unmap.
+ */
+/* Converts guest physical address into host virtual */
+static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
+   unsigned long gpa)
+{
+   unsigned long hva, gfn = gpa  PAGE_SHIFT

[PATCH 3/6] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-05-06 Thread Alexey Kardashevskiy
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a fast hypercall directly in real
mode (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Reviewed-by: Paul Mackerras pau...@samba.org
Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/pgtable-ppc64.h |4 ++
 arch/powerpc/mm/init_64.c|   77 +-
 2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index 0182c20..4c56ede 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -377,6 +377,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t 
*pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 95a4529..838b8ae 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,80 @@ int __meminit vmemmap_populate(struct page *start_page,
 
return 0;
 }
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_ functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct vmemmap_backing *vmem_back;
+   struct page *page;
+   unsigned long page_size = 1  mmu_psize_defs[mmu_vmemmap_psize].shift;
+   unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+   for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back-list) {
+   if (pg_va  vmem_back-virt_addr)
+   continue;
+
+   /* Check that page struct is not split between real pages */
+   if ((pg_va + sizeof(struct page)) 
+   (vmem_back-virt_addr + page_size))
+   return NULL;
+
+   page = (struct page *) (vmem_back-phys + pg_va -
+   vmem_back-virt_addr);
+   return page;
+   }
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct page *page = pfn_to_page(pfn);
+   return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+   if (PageTail(page))
+   return -EAGAIN;
+
+   get_page(page);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+   if (PageCompound(page))
+   return -EAGAIN;
+
+   put_page(page);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-05-06 Thread Alexey Kardashevskiy
This adds special support for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_host.h |2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   24 +++
 arch/powerpc/kvm/book3s_64_vio.c|   79 ++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |   47 -
 4 files changed, 149 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 2b70cbc..b6a047e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,8 @@ struct kvmppc_spapr_tce_table {
u32 window_size;
bool virtmode_only;
struct iommu_group *grp;/* used for IOMMU groups */
+   struct list_head hugepages; /* used for IOMMU groups */
+   spinlock_t hugepages_lock;  /* used for IOMMU groups */
struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index bdfa140..3c95464 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -154,6 +154,30 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct 
kvm_vcpu *vcpu,
 extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
unsigned long liobn, unsigned long ioba,
unsigned long tce_value, unsigned long npages);
+
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct iommu_kvmppc_hugepage {
+   struct list_head list;
+   pte_t pte;  /* Huge page PTE */
+   unsigned long pa;   /* Base phys address used as a real TCE */
+   struct page *page;  /* page struct of the very first subpage */
+   unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+extern struct iommu_kvmppc_hugepage *kvmppc_iommu_hugepage_find(
+   struct kvmppc_spapr_tce_table *tt, pte_t pte);
+
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 98cf949..274458d 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -54,6 +54,59 @@ static bool kvmppc_tce_virt_only = false;
 module_param_named(virt_only, kvmppc_tce_virt_only, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(virt_only, Disable realmode handling of IOMMU map/unmap);
 
+#ifdef CONFIG_IOMMU_API
+/*
+ * Adds a new huge page descriptor to the list.
+ */
+static struct iommu_kvmppc_hugepage *kvmppc_iommu_hugepage_add(
+   struct kvmppc_spapr_tce_table *tt,
+   pte_t pte, unsigned long va, unsigned long pg_size)
+{
+   int ret;
+   struct iommu_kvmppc_hugepage *hp;
+   struct page *p;
+
+   va = va  ~(pg_size - 1);
+   ret = get_user_pages_fast(va, 1, true/*write*/, p);
+   if ((ret != 1) || !p)
+   return NULL;
+
+   hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+   if (!hp)
+   return NULL;
+
+   hp-page = p;
+   hp-pte = pte;
+   hp-pa = __pa((unsigned long) page_address(hp-page));
+   hp-size = pg_size;
+
+   spin_lock(tt-hugepages_lock);
+   list_add(hp-list, tt-hugepages);
+   spin_unlock(tt-hugepages_lock);
+
+   return hp;
+}
+
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+   INIT_LIST_HEAD(tt-hugepages);
+   spin_lock_init(tt-hugepages_lock);
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+   struct iommu_kvmppc_hugepage *hp, *tmp;
+
+   spin_lock(tt-hugepages_lock);
+   list_for_each_entry_safe(hp, tmp, tt-hugepages, list

[PATCH 0/6] KVM: PPC: IOMMU in-kernel handling

2013-05-06 Thread Alexey Kardashevskiy
This series is supposed to accelerate IOMMU operations in real and virtual
mode in the host kernel for the KVM guest.

The first user is VFIO however this series does not contain any VFIO related
code as the connection between VFIO and the new handlers is to be made in QEMU
via ioctl to the KVM fd.

Although the series compiles, it does not make sense without VFIO patches which
are posted separately.

The iommu: Add a function to find an iommu group by id patch has already
gone to linux-next (from iommu tree) but it is not in upstream yet so
I am including it here for the reference.


Alexey Kardashevskiy (6):
  KVM: PPC: Make lookup_linux_pte public
  KVM: PPC: Add support for multiple-TCE hcalls
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  iommu: Add a function to find an iommu group by id
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt|   43 +++
 arch/powerpc/include/asm/kvm_host.h  |4 +
 arch/powerpc/include/asm/kvm_ppc.h   |   44 ++-
 arch/powerpc/include/asm/pgtable-ppc64.h |4 +
 arch/powerpc/include/uapi/asm/kvm.h  |7 +
 arch/powerpc/kvm/book3s_64_vio.c |  433 +++-
 arch/powerpc/kvm/book3s_64_vio_hv.c  |  464 --
 arch/powerpc/kvm/book3s_hv.c |   23 ++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |5 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |6 +
 arch/powerpc/kvm/book3s_pr_papr.c|   37 ++-
 arch/powerpc/kvm/powerpc.c   |   15 +
 arch/powerpc/mm/init_64.c|   77 -
 drivers/iommu/iommu.c|   29 ++
 include/linux/iommu.h|1 +
 include/uapi/linux/kvm.h |3 +
 16 files changed, 1159 insertions(+), 36 deletions(-)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-06 Thread Alexey Kardashevskiy
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which should
save time on switching to QEMU and back.

Both real and virtual modes are supported - whenever the kernel
fails to handle TCE request, it passes it to the virtual mode.
If it the virtual mode handlers fail, then the request is passed
to the user mode, for example, to QEMU.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
in-kernel handling of IOMMU map/unmap.

This adds a special case for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

This also adds the virt_only parameter to the KVM module
for debug and performance check purposes.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Signed-off-by: Paul Mackerras pau...@samba.org
---
 Documentation/virtual/kvm/api.txt   |   28 
 arch/powerpc/include/asm/kvm_host.h |2 +
 arch/powerpc/include/asm/kvm_ppc.h  |2 +
 arch/powerpc/include/uapi/asm/kvm.h |7 +
 arch/powerpc/kvm/book3s_64_vio.c|  242 ++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++
 arch/powerpc/kvm/powerpc.c  |   12 ++
 include/uapi/linux/kvm.h|2 +
 8 files changed, 485 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index f621cd6..2039767 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating 
any previously
 valid entries found.
 
 
+4.79 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+   __u64 liobn;
+   __u32 iommu_id;
+   __u32 flags;
+};
+
+No flag is supported at the moment.
+
+When the guest issues TCE call on a liobn for which a TCE table has been
+registered, the kernel will handle it in real mode, updating the hardware
+TCE table. TCE table calls for other liobns will cause a vm exit and must
+be handled by userspace.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 36ceb0d..2b70cbc 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+   bool virtmode_only;
+   struct iommu_group *grp;/* used for IOMMU groups */
struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index d501246..bdfa140 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+   struct kvm_create_spapr_tce_iommu *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 681b314..b67d44b 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+   __u64 liobn;
+   __u32 iommu_id;
+   __u32 flags;
+};
+
 /* for KVM_ALLOCATE_RMA

[PATCH 4/6] iommu: Add a function to find an iommu group by id

2013-05-06 Thread Alexey Kardashevskiy
As IOMMU groups are exposed to the user space by their numbers,
the user space can use them in various kernel APIs so the kernel
might need an API to find a group by its ID.

As an example, QEMU VFIO on PPC64 platform needs it to associate
a logical bus number (LIOBN) with a specific IOMMU group in order
to support in-kernel handling of DMA map/unmap requests.

This adds the iommu_group_get_by_id(id) function which performs
this search.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Signed-off-by: Paul Mackerras pau...@samba.org
---
 drivers/iommu/iommu.c |   29 +
 include/linux/iommu.h |1 +
 2 files changed, 30 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ddbdaca..5514dfa 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -204,6 +204,35 @@ again:
 }
 EXPORT_SYMBOL_GPL(iommu_group_alloc);
 
+struct iommu_group *iommu_group_get_by_id(int id)
+{
+   struct kobject *group_kobj;
+   struct iommu_group *group;
+   const char *name;
+
+   if (!iommu_group_kset)
+   return NULL;
+
+   name = kasprintf(GFP_KERNEL, %d, id);
+   if (!name)
+   return NULL;
+
+   group_kobj = kset_find_obj(iommu_group_kset, name);
+   kfree(name);
+
+   if (!group_kobj)
+   return NULL;
+
+   group = container_of(group_kobj, struct iommu_group, kobj);
+   BUG_ON(group-id != id);
+
+   kobject_get(group-devices_kobj);
+   kobject_put(group-kobj);
+
+   return group;
+}
+EXPORT_SYMBOL_GPL(iommu_group_get_by_id);
+
 /**
  * iommu_group_get_iommudata - retrieve iommu_data registered for a group
  * @group: the group
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f3b99e1..00e5d7d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -113,6 +113,7 @@ struct iommu_ops {
 extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
 extern bool iommu_present(struct bus_type *bus);
 extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
+extern struct iommu_group *iommu_group_get_by_id(int id);
 extern void iommu_domain_free(struct iommu_domain *domain);
 extern int iommu_attach_device(struct iommu_domain *domain,
   struct device *dev);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] KVM: PPC: Make lookup_linux_pte public

2013-05-06 Thread Alexey Kardashevskiy
The lookup_linux_pte() function returns a linux PTE which is needed in
the process of converting KVM guest physical address into host real
address in real mode.

This conversion will be used by upcoming support of H_PUT_TCE_INDIRECT,
as the TCE list address comes from the guest and is a guest physical
address.  This makes lookup_linux_pte() public so that code can call
it.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_ppc.h  |3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |5 +++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 41426c9..99da298 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -379,4 +379,7 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu 
*vcpu, int ra, int rb)
return ea;
 }
 
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+   int writing, unsigned long *pte_sizep);
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 6dcbb49..18fc382 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -134,8 +134,8 @@ static void remove_revmap_chain(struct kvm *kvm, long 
pte_index,
unlock_rmap(rmap);
 }
 
-static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
- int writing, unsigned long *pte_sizep)
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+  int writing, unsigned long *pte_sizep)
 {
pte_t *ptep;
unsigned long ps = *pte_sizep;
@@ -154,6 +154,7 @@ static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long 
hva,
return __pte(0);
return kvmppc_read_update_linux_pte(ptep, writing);
 }
+EXPORT_SYMBOL_GPL(lookup_linux_pte);
 
 static inline void unlock_hpte(unsigned long *hpte, unsigned long hpte_v)
 {
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: PPC: iommu: Add missing kvm_iommu_map_pages/kvm_iommu_unmap_pages

2013-05-06 Thread Alexey Kardashevskiy
On 05/07/2013 07:07 AM, Alex Williamson wrote:
 On Mon, 2013-05-06 at 17:21 +1000, a...@ozlabs.ru wrote:
 From: Alexey Kardashevskiy a...@ozlabs.ru

 The IOMMU API implements groups creating/deletion, device binding
 and IOMMU map/unmap operations.

 The PowerPC implementation uses most of the API except map/unmap
 operations, which are implemented on POWER using hypercalls.

 However, in order to link a kernel with the CONFIG_IOMMU_API enabled,
 the empty kvm_iommu_map_pages/kvm_iommu_unmap_pages have to be
 defined, so this defines them.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Cc: David Gibson da...@gibson.dropbear.id.au
 Signed-off-by: Paul Mackerras pau...@samba.org
 ---
  arch/powerpc/include/asm/kvm_host.h |   14 ++
  1 file changed, 14 insertions(+)

 diff --git a/arch/powerpc/include/asm/kvm_host.h 
 b/arch/powerpc/include/asm/kvm_host.h
 index b6a047e..c025d91 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -603,4 +603,18 @@ struct kvm_vcpu_arch {
  
  #define __KVM_HAVE_ARCH_WQP
  
 +#ifdef CONFIG_IOMMU_API
 +/* POWERPC does not use IOMMU API for mapping/unmapping */
 +static inline int kvm_iommu_map_pages(struct kvm *kvm,
 +struct kvm_memory_slot *slot)
 +{
 +return 0;
 +}
 +
 +static inline void kvm_iommu_unmap_pages(struct kvm *kvm,
 +struct kvm_memory_slot *slot)
 +{
 +}
 +#endif /* CONFIG_IOMMU_API */
 +
  #endif /* __POWERPC_KVM_HOST_H__ */
 
 This is no longer needed, Gleb applied my patch for 3.10 that make all
 of KVM device assignment dependent on a build config option and the top
 level kvm_host.h now includes this when that is not set.  Thanks,

Cannot find it, could you point me please where it is on github or
git.kernel.org? Thanks.


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: PPC: iommu: Add missing kvm_iommu_map_pages/kvm_iommu_unmap_pages

2013-05-06 Thread Alexey Kardashevskiy
On 05/07/2013 11:42 AM, Alex Williamson wrote:
 On Tue, 2013-05-07 at 10:49 +1000, Alexey Kardashevskiy wrote:
 On 05/07/2013 07:07 AM, Alex Williamson wrote:
 On Mon, 2013-05-06 at 17:21 +1000, a...@ozlabs.ru wrote:
 From: Alexey Kardashevskiy a...@ozlabs.ru

 The IOMMU API implements groups creating/deletion, device binding
 and IOMMU map/unmap operations.

 The PowerPC implementation uses most of the API except map/unmap
 operations, which are implemented on POWER using hypercalls.

 However, in order to link a kernel with the CONFIG_IOMMU_API enabled,
 the empty kvm_iommu_map_pages/kvm_iommu_unmap_pages have to be
 defined, so this defines them.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Cc: David Gibson da...@gibson.dropbear.id.au
 Signed-off-by: Paul Mackerras pau...@samba.org
 ---
  arch/powerpc/include/asm/kvm_host.h |   14 ++
  1 file changed, 14 insertions(+)

 diff --git a/arch/powerpc/include/asm/kvm_host.h 
 b/arch/powerpc/include/asm/kvm_host.h
 index b6a047e..c025d91 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -603,4 +603,18 @@ struct kvm_vcpu_arch {
  
  #define __KVM_HAVE_ARCH_WQP
  
 +#ifdef CONFIG_IOMMU_API
 +/* POWERPC does not use IOMMU API for mapping/unmapping */
 +static inline int kvm_iommu_map_pages(struct kvm *kvm,
 +  struct kvm_memory_slot *slot)
 +{
 +  return 0;
 +}
 +
 +static inline void kvm_iommu_unmap_pages(struct kvm *kvm,
 +  struct kvm_memory_slot *slot)
 +{
 +}
 +#endif /* CONFIG_IOMMU_API */
 +
  #endif /* __POWERPC_KVM_HOST_H__ */

 This is no longer needed, Gleb applied my patch for 3.10 that make all
 of KVM device assignment dependent on a build config option and the top
 level kvm_host.h now includes this when that is not set.  Thanks,

 Cannot find it, could you point me please where it is on github or
 git.kernel.org? Thanks.
 
 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2a5bab1004729f3302c776e53ee7c895b98bb1ce


Yes, I confirm, this is patch is not need any more. Thanks!



-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-07 Thread Alexey Kardashevskiy
On 05/07/2013 04:02 PM, David Gibson wrote:
 On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
 On 05/07/2013 03:29 PM, David Gibson wrote:
 On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
 This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
 and H_STUFF_TCE requests without passing them to QEMU, which should
 save time on switching to QEMU and back.

 Both real and virtual modes are supported - whenever the kernel
 fails to handle TCE request, it passes it to the virtual mode.
 If it the virtual mode handlers fail, then the request is passed
 to the user mode, for example, to QEMU.

 This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
 a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
 in-kernel handling of IOMMU map/unmap.

 This adds a special case for huge pages (16MB).  The reference
 counting cannot be easily done for such pages in real mode (when
 MMU is off) so we added a list of huge pages.  It is populated in
 virtual mode and get_page is called just once per a huge page.
 Real mode handlers check if the requested page is huge and in the list,
 then no reference counting is done, otherwise an exit to virtual mode
 happens.  The list is released at KVM exit.  At the moment the fastest
 card available for tests uses up to 9 huge pages so walking through this
 list is not very expensive.  However this can change and we may want
 to optimize this.

 This also adds the virt_only parameter to the KVM module
 for debug and performance check purposes.

 Tests show that this patch increases transmission speed from 220MB/s
 to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

 Cc: David Gibson da...@gibson.dropbear.id.au
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Signed-off-by: Paul Mackerras pau...@samba.org
 ---
  Documentation/virtual/kvm/api.txt   |   28 
  arch/powerpc/include/asm/kvm_host.h |2 +
  arch/powerpc/include/asm/kvm_ppc.h  |2 +
  arch/powerpc/include/uapi/asm/kvm.h |7 +
  arch/powerpc/kvm/book3s_64_vio.c|  242 
 ++-
  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++
  arch/powerpc/kvm/powerpc.c  |   12 ++
  include/uapi/linux/kvm.h|2 +
  8 files changed, 485 insertions(+), 2 deletions(-)

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index f621cd6..2039767 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, 
 invalidating any previously
  valid entries found.
  
  
 +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
 +
 +Capability: KVM_CAP_SPAPR_TCE_IOMMU
 +Architectures: powerpc
 +Type: vm ioctl
 +Parameters: struct kvm_create_spapr_tce_iommu (in)
 +Returns: 0 on success, -1 on error
 +
 +This creates a link between IOMMU group and a hardware TCE (translation
 +control entry) table. This link lets the host kernel know what IOMMU
 +group (i.e. TCE table) to use for the LIOBN number passed with
 +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
 +
 +/* for KVM_CAP_SPAPR_TCE_IOMMU */
 +struct kvm_create_spapr_tce_iommu {
 +  __u64 liobn;
 +  __u32 iommu_id;

 Wouldn't it be more in keeping 


 pardon?
 
 Sorry, I was going to suggest a change, but then realised it wasn't
 actually any better than what you have now.
 
 +  __u32 flags;
 +};
 +
 +No flag is supported at the moment.
 +
 +When the guest issues TCE call on a liobn for which a TCE table has been
 +registered, the kernel will handle it in real mode, updating the hardware
 +TCE table. TCE table calls for other liobns will cause a vm exit and must
 +be handled by userspace.
 +
 +
  5. The kvm_run structure
  
  
 diff --git a/arch/powerpc/include/asm/kvm_host.h 
 b/arch/powerpc/include/asm/kvm_host.h
 index 36ceb0d..2b70cbc 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
 +  bool virtmode_only;

 I see this is now initialized from the global parameter, but I think
 it would be better to just check the global (debug) parameter
 directly, rather than duplicating it here.


 The global parameter is in kvm.ko and the struct above is in the real mode
 part which cannot go to the module.
 
 Ah, ok.  I'm half inclined to just drop the virtmode_only thing
 entirely.
 
 +  struct iommu_group *grp;/* used for IOMMU groups */
struct page *pages[0];
  };
  
 diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
 b/arch/powerpc/include/asm/kvm_ppc.h
 index d501246..bdfa140 100644
 --- a/arch/powerpc/include/asm/kvm_ppc.h
 +++ b/arch/powerpc/include/asm/kvm_ppc.h
 @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
  
  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct

Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls

2013-05-10 Thread Alexey Kardashevskiy
On 05/10/2013 04:51 PM, David Gibson wrote:
 On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote:
 This adds real mode handlers for the H_PUT_TCE_INDIRECT and
 H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
 devices or emulated PCI.  These calls allow adding multiple entries
 (up to 512) into the TCE table in one call which saves time on
 transition to/from real mode.

 This adds a guest physical to host real address converter
 and calls the existing H_PUT_TCE handler. The converting function
 is going to be fully utilized by upcoming VFIO supporting patches.

 This also implements the KVM_CAP_PPC_MULTITCE capability,
 so in order to support the functionality of this patch, QEMU
 needs to query for this capability and set the hcall-multi-tce
 hypertas property only if the capability is present, otherwise
 there will be serious performance degradation.
 
 
 Hrm.  Clearly I didn't read this carefully enough before.  There are
 some problems here.

?


 [snip]
 diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
 b/arch/powerpc/kvm/book3s_64_vio.c
 index 72ffc89..643ac1e 100644
 --- a/arch/powerpc/kvm/book3s_64_vio.c
 +++ b/arch/powerpc/kvm/book3s_64_vio.c
 @@ -14,6 +14,7 @@
   *
   * Copyright 2010 Paul Mackerras, IBM Corp. pau...@au1.ibm.com
   * Copyright 2011 David Gibson, IBM Corporation d...@au1.ibm.com
 + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation a...@au1.ibm.com
   */
  
  #include linux/types.h
 @@ -36,9 +37,14 @@
  #include asm/ppc-opcode.h
  #include asm/kvm_host.h
  #include asm/udbg.h
 +#include asm/iommu.h
  
  #define TCES_PER_PAGE   (PAGE_SIZE / sizeof(u64))
 +#define ERROR_ADDR  (~(unsigned long)0x0)
  
 +/*
 + * TCE tables handlers.
 + */
  static long kvmppc_stt_npages(unsigned long window_size)
  {
  return ALIGN((window_size  SPAPR_TCE_SHIFT)
 @@ -148,3 +154,111 @@ fail:
  }
  return ret;
  }
 +
 +/*
 + * Virtual mode handling of IOMMU map/unmap.
 + */
 +/* Converts guest physical address into host virtual */
 +static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
 +unsigned long gpa)
 
 This should probably return a void * rather than an unsigned long.
 Well, actually a void __user *.
 
 +{
 +unsigned long hva, gfn = gpa  PAGE_SHIFT;
 +struct kvm_memory_slot *memslot;
 +
 +memslot = search_memslots(kvm_memslots(vcpu-kvm), gfn);
 +if (!memslot)
 +return ERROR_ADDR;
 +
 +/*
 + * Convert gfn to hva preserving flags and an offset
 + * within a system page
 + */
 +hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa  ~PAGE_MASK);
 +return hva;
 +}
 +
 +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 +unsigned long liobn, unsigned long ioba,
 +unsigned long tce)
 +{
 +struct kvmppc_spapr_tce_table *tt;
 +
 +tt = kvmppc_find_tce_table(vcpu, liobn);
 +/* Didn't find the liobn, put it to userspace */
 +if (!tt)
 +return H_TOO_HARD;
 +
 +/* Emulated IO */
 +return kvmppc_emulated_h_put_tce(tt, ioba, tce);
 +}
 +
 +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 +unsigned long liobn, unsigned long ioba,
 +unsigned long tce_list, unsigned long npages)
 +{
 +struct kvmppc_spapr_tce_table *tt;
 +long i;
 +unsigned long tces;
 +
 +/* The whole table addressed by tce_list resides in 4K page */
 +if (npages  512)
 +return H_PARAMETER;
 
 So, that doesn't actually verify what the comment says it does - only
 that the list is  4K in total.  You need to check the alignment of
 tce_list as well.



The spec says to return H_PARAMETER if 512. I.e. it takes just 1 page and
I do not need to bother if pages may not lay continuously in RAM (matters
for real mode).

/*
 * As the spec is saying that maximum possible number of TCEs is 512,
 * the whole TCE page is no more than 4K. Therefore we do not have to
 * worry if pages do not lie continuously in the RAM
 */
Any better?...


 +
 +tt = kvmppc_find_tce_table(vcpu, liobn);
 +/* Didn't find the liobn, put it to userspace */
 +if (!tt)
 +return H_TOO_HARD;
 +
 +tces = get_virt_address(vcpu, tce_list);
 +if (tces == ERROR_ADDR)
 +return H_TOO_HARD;
 +
 +/* Emulated IO */
 
 This comment doesn't seem to have any bearing on the test which
 follows it.
 
 +if ((ioba + (npages  IOMMU_PAGE_SHIFT))  tt-window_size)
 +return H_PARAMETER;
 +
 +for (i = 0; i  npages; ++i) {
 +unsigned long tce;
 +unsigned long ptce = tces + i * sizeof(unsigned long);
 +
 +if (get_user(tce, (unsigned long __user *)ptce))
 +break;
 +
 +if (kvmppc_emulated_h_put_tce(tt,
 +ioba + (i  IOMMU_PAGE_SHIFT), tce))
 +break;
 +}
 +if (i == npages)
 +return H_SUCCESS;
 +
 +/* Failed, do cleanup */
 +do {
 +--i

[PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls

2013-05-20 Thread Alexey Kardashevskiy
This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the hcall-multi-tce
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Signed-off-by: Paul Mackerras pau...@samba.org

---
Changelog:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
---
 Documentation/virtual/kvm/api.txt   |   14 ++
 arch/powerpc/include/asm/kvm_host.h |2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   16 +-
 arch/powerpc/kvm/book3s_64_vio.c|  118 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c |  266 +++
 arch/powerpc/kvm/book3s_hv.c|   39 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 +
 arch/powerpc/kvm/book3s_pr_papr.c   |   37 -
 arch/powerpc/kvm/powerpc.c  |3 +
 include/uapi/linux/kvm.h|1 +
 10 files changed, 470 insertions(+), 32 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 5f91eda..3c7c7ea 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2780,3 +2780,17 @@ Parameters: args[0] is the XICS device fd
 args[1] is the XICS CPU number (server ID) for this vcpu
 
 This capability connects the vcpu to an in-kernel XICS device.
+
+6.8 KVM_CAP_PPC_MULTITCE
+
+Architectures: ppc
+Parameters: none
+Returns: 0 on success; -1 on error
+
+This capability enables the guest to put/remove multiple TCE entries
+per hypercall which significanly accelerates DMA operations for PPC KVM
+guests.
+
+When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
+expected to occur rather than H_PUT_TCE which supports only one TCE entry
+per call.
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index af326cd..85d8f26 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+   unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hall */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..e852921b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+   struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_validate_tce(unsigned long tce);
+extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+   unsigned long ioba, unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma

[PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-05-20 Thread Alexey Kardashevskiy
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a fast hypercall directly in real
mode (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Reviewed-by: Paul Mackerras pau...@samba.org
Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Paul Mackerras pau...@samba.org

---

Changes:
2013-05-20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
---
 arch/powerpc/include/asm/pgtable-ppc64.h |4 ++
 arch/powerpc/mm/init_64.c|   77 +-
 2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..7b46e5f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t 
*pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index c2787bf..ba6cf9b 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -296,5 +296,80 @@ void vmemmap_free(unsigned long start, unsigned long end)
 {
 }
 
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_ functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct vmemmap_backing *vmem_back;
+   struct page *page;
+   unsigned long page_size = 1  mmu_psize_defs[mmu_vmemmap_psize].shift;
+   unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+   for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back-list) {
+   if (pg_va  vmem_back-virt_addr)
+   continue;
 
+   /* Check that page struct is not split between real pages */
+   if ((pg_va + sizeof(struct page)) 
+   (vmem_back-virt_addr + page_size))
+   return NULL;
+
+   page = (struct page *) (vmem_back-phys + pg_va -
+   vmem_back-virt_addr);
+   return page;
+   }
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct page *page = pfn_to_page(pfn);
+   return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+   if (PageCompound(page))
+   return -EAGAIN;
+
+   get_page(page);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+   if (PageCompound(page))
+   return -EAGAIN;
+
+   if (!atomic_add_unless(page-_count, -1, 1))
+   return -EAGAIN;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http

[PATCH 0/4 v2] KVM: PPC: IOMMU in-kernel handling

2013-05-20 Thread Alexey Kardashevskiy
This accelerates IOMMU operations in real and virtual
mode in the host kernel for the KVM guest.

The first patch with multitce support is useful for emulated devices as is.

The other patches are designed for VFIO although this series
does not contain any VFIO related code as the connection between
VFIO and the new handlers is to be made in QEMU
via ioctl to the KVM fd.

The series was made and tested against v3.10-rc1.


Alexey Kardashevskiy (4):
  KVM: PPC: Add support for multiple-TCE hcalls
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt|   42 +++
 arch/powerpc/include/asm/kvm_host.h  |7 +
 arch/powerpc/include/asm/kvm_ppc.h   |   40 ++-
 arch/powerpc/include/asm/pgtable-ppc64.h |4 +
 arch/powerpc/include/uapi/asm/kvm.h  |7 +
 arch/powerpc/kvm/book3s_64_vio.c |  398 -
 arch/powerpc/kvm/book3s_64_vio_hv.c  |  471 --
 arch/powerpc/kvm/book3s_hv.c |   39 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |6 +
 arch/powerpc/kvm/book3s_pr_papr.c|   37 ++-
 arch/powerpc/kvm/powerpc.c   |   15 +
 arch/powerpc/mm/init_64.c|   77 -
 include/uapi/linux/kvm.h |5 +
 13 files changed, 1120 insertions(+), 28 deletions(-)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-20 Thread Alexey Kardashevskiy
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which should
save time on switching to QEMU and back.

Both real and virtual modes are supported - whenever the kernel
fails to handle TCE request, it passes it to the virtual mode.
If it the virtual mode handlers fail, then the request is passed
to the user mode, for example, to QEMU.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
in-kernel handling of IOMMU map/unmap.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Signed-off-by: Paul Mackerras pau...@samba.org

---

Changes:
2013-05-20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
 Documentation/virtual/kvm/api.txt   |   28 +
 arch/powerpc/include/asm/kvm_host.h |3 +
 arch/powerpc/include/asm/kvm_ppc.h  |2 +
 arch/powerpc/include/uapi/asm/kvm.h |7 ++
 arch/powerpc/kvm/book3s_64_vio.c|  198 ++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  193 +-
 arch/powerpc/kvm/powerpc.c  |   12 +++
 include/uapi/linux/kvm.h|4 +
 8 files changed, 441 insertions(+), 6 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 3c7c7ea..3c8e9fe 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,34 @@ calls by the guest for that service will be passed to 
userspace to be
 handled.
 
 
+4.79 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+   __u64 liobn;
+   __u32 iommu_id;
+   __u32 flags;
+};
+
+No flag is supported at the moment.
+
+When the guest issues TCE call on a liobn for which a TCE table has been
+registered, the kernel will handle it in real mode, updating the hardware
+TCE table. TCE table calls for other liobns will cause a vm exit and must
+be handled by userspace.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 85d8f26..ac0e2fe 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+   struct iommu_group *grp;/* used for IOMMU groups */
struct page *pages[0];
 };
 
@@ -611,6 +612,8 @@ struct kvm_vcpu_arch {
u64 busy_preempt;
 
unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hall */
+   unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */
+   unsigned long tce_reason;  /* The reason of switching to the virtmode */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index e852921b..934e01d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+   struct kvm_create_spapr_tce_iommu *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_emulated_validate_tce(unsigned long tce);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..cf82af4 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -319,6 +319,13 @@ struct kvm_create_spapr_tce {
__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU

[PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-05-20 Thread Alexey Kardashevskiy
This adds special support for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Signed-off-by: Paul Mackerras pau...@samba.org

---

Changes:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).
---
 arch/powerpc/include/asm/kvm_host.h |2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   22 +
 arch/powerpc/kvm/book3s_64_vio.c|   88 +--
 arch/powerpc/kvm/book3s_64_vio_hv.c |   40 ++--
 4 files changed, 146 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index ac0e2fe..4fc0865 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table {
u64 liobn;
u32 window_size;
struct iommu_group *grp;/* used for IOMMU groups */
+   struct list_head hugepages; /* used for IOMMU groups */
+   spinlock_t hugepages_lock;  /* used for IOMMU groups */
struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 934e01d..9054df0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct 
kvm_vcpu *vcpu,
 extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
unsigned long liobn, unsigned long ioba,
unsigned long tce_value, unsigned long npages);
+
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct kvmppc_iommu_hugepage {
+   struct list_head list;
+   pte_t pte;  /* Huge page PTE */
+   unsigned long gpa;  /* Guest physical address */
+   struct page *page;  /* page struct of the very first subpage */
+   unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index ffb4698..c34d63a 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,6 +45,71 @@
 #define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
 #define ERROR_ADDR  ((void *)~(unsigned long)0x0)
 
+#ifdef CONFIG_IOMMU_API
+/* Adds a new huge page descriptor to the list  */
+static long kvmppc_iommu_hugepage_try_add(
+   struct kvmppc_spapr_tce_table *tt,
+   pte_t pte, unsigned long hva, unsigned long gpa,
+   unsigned long pg_size)
+{
+   long ret = 0;
+   struct kvmppc_iommu_hugepage *hp;
+   struct page *p;
+
+   spin_lock(tt-hugepages_lock);
+   list_for_each_entry(hp, tt-hugepages, list) {
+   if (hp-pte == pte)
+   goto unlock_exit;
+   }
+
+   hva = hva  ~(pg_size - 1);
+   ret = get_user_pages_fast(hva, 1, true/*write*/, p);
+   if ((ret != 1) || !p) {
+   ret = -EFAULT;
+   goto unlock_exit;
+   }
+   ret = 0;
+
+   hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+   if (!hp) {
+   ret = -ENOMEM;
+   goto unlock_exit;
+   }
+
+   hp-page = p;
+   hp-pte = pte;
+   hp-gpa = gpa  ~(pg_size - 1);
+   hp-size = pg_size;
+
+   list_add(hp-list, tt-hugepages);
+
+unlock_exit

Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-26 Thread Alexey Kardashevskiy
On 05/25/2013 12:45 PM, David Gibson wrote:
 On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
 On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
 diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
 index 8465c2a..da6bf61 100644
 --- a/arch/powerpc/kvm/powerpc.c
 @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 +++ b/arch/powerpc/kvm/powerpc.c
 break;
 #endif
 case KVM_CAP_SPAPR_MULTITCE:
 +   case KVM_CAP_SPAPR_TCE_IOMMU:
 r = 1;
 break;
 default:

 Don't advertise SPAPR capabilities if it's not book3s -- and
 probably there's some additional limitation that would be
 appropriate.
 
 So, in the case of MULTITCE, that's not quite right.  PR KVM can
 emulate a PAPR system on a BookE machine, and there's no reason not to
 allow TCE acceleration as well.  We can't make it dependent on PAPR
 mode being selected, because that's enabled per-vcpu, whereas these
 capabilities are queried on the VM before the vcpus are created.
 
 CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
 host side hardware (i.e. a PAPR style IOMMU), though.


The capability says that the ioctl is supported. If there is no IOMMU group
registered, than it will fail with a reasonable error and nobody gets hurt.
What is the problem?



 @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 r = kvm_vm_ioctl_create_spapr_tce(kvm, create_tce);
 goto out;
 }
 +   case KVM_CREATE_SPAPR_TCE_IOMMU: {
 +   struct kvm_create_spapr_tce_iommu create_tce_iommu;
 +   struct kvm *kvm = filp-private_data;
 +
 +   r = -EFAULT;
 +   if (copy_from_user(create_tce_iommu, argp,
 +   sizeof(create_tce_iommu)))
 +   goto out;
 +   r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
 create_tce_iommu);
 +   goto out;
 +   }
 #endif /* CONFIG_PPC_BOOK3S_64 */

 #ifdef CONFIG_KVM_BOOK3S_64_HV
 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 index 5a2afda..450c82a 100644
 --- a/include/uapi/linux/kvm.h
 +++ b/include/uapi/linux/kvm.h
 @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_SPAPR_MULTITCE (0x11 + 89)
 +#define KVM_CAP_SPAPR_TCE_IOMMU (0x11 + 90)

 Hmm...
 
 Ah, yeah, that needs to be fixed.  Those were interim numbers so that
 we didn't have to keep changing our internal trees as new upstream
 ioctls got added to the list.  We need to get a proper number for the
 merge, though.

 @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR   _IOW(KVMIO,  0xe2, struct
 kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR   _IOW(KVMIO,  0xe3, struct
 kvm_device_attr)

 +/* ioctl for SPAPR TCE IOMMU */
 +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
 kvm_create_spapr_tce_iommu)

 Shouldn't this go under the vm ioctl section?


The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated devices) is
in this section so I decided to keep them together. Wrong?


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-27 Thread Alexey Kardashevskiy
On 05/27/2013 08:23 PM, Paolo Bonzini wrote:
 Il 25/05/2013 04:45, David Gibson ha scritto:
 +  case KVM_CREATE_SPAPR_TCE_IOMMU: {
 +  struct kvm_create_spapr_tce_iommu create_tce_iommu;
 +  struct kvm *kvm = filp-private_data;
 +
 +  r = -EFAULT;
 +  if (copy_from_user(create_tce_iommu, argp,
 +  sizeof(create_tce_iommu)))
 +  goto out;
 +  r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
 create_tce_iommu);
 +  goto out;
 +  }
 
 Would it make sense to make this the only interface for creating TCEs?
 That is, pass both a window_size and an IOMMU group id (or e.g. -1 for
 no hardware IOMMU usage), and have a single ioctl for both cases?
 There's some duplicated code between kvm_vm_ioctl_create_spapr_tce and
 kvm_vm_ioctl_create_spapr_tce_iommu.

Just few bits. Is there really much sense in making one function from those
two? I tried, looked a bit messy.

 KVM_CREATE_SPAPR_TCE could stay for backwards-compatibility, or you
 could just use a new capability and drop the old ioctl.

The old capability+ioctl already exist for quite a while and few QEMU
versions supporting it were released so we do not want just drop it. So
then what is the benefit of having a new interface with support of both types?

  I'm not sure
 whether you're already considering the ABI to be stable for kvmppc.

Is any bit of KVM using it? Cannot see from Documentation/ABI.


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-28 Thread Alexey Kardashevskiy
On 05/29/2013 03:45 AM, Scott Wood wrote:
 On 05/26/2013 09:44:24 PM, Alexey Kardashevskiy wrote:
 On 05/25/2013 12:45 PM, David Gibson wrote:
  On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
  On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
  diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
  index 8465c2a..da6bf61 100644
  --- a/arch/powerpc/kvm/powerpc.c
  @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
  +++ b/arch/powerpc/kvm/powerpc.c
  break;
  #endif
  case KVM_CAP_SPAPR_MULTITCE:
  +case KVM_CAP_SPAPR_TCE_IOMMU:
  r = 1;
  break;
  default:
 
  Don't advertise SPAPR capabilities if it's not book3s -- and
  probably there's some additional limitation that would be
  appropriate.
 
  So, in the case of MULTITCE, that's not quite right.  PR KVM can
  emulate a PAPR system on a BookE machine, and there's no reason not to
  allow TCE acceleration as well.  We can't make it dependent on PAPR
  mode being selected, because that's enabled per-vcpu, whereas these
  capabilities are queried on the VM before the vcpus are created.
 
  CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
  host side hardware (i.e. a PAPR style IOMMU), though.


 The capability says that the ioctl is supported. If there is no IOMMU group
 registered, than it will fail with a reasonable error and nobody gets hurt.
 What is the problem?
 
 You could say that about a lot of the capabilities that just advertise the
 existence of new ioctls. :-)
 
 Sometimes it's nice to know in advance whether it's supported, before
 actually requesting that something happen.

Yes, would be nice. There is just no quick way to know if this real system
supports IOMMU groups. I could add another helper to generic IOMMU code
which would return the number of registered IOMMU groups but it is a bit
too much :)


  @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
  #define KVM_GET_DEVICE_ATTR  _IOW(KVMIO,  0xe2, struct
  kvm_device_attr)
  #define KVM_HAS_DEVICE_ATTR  _IOW(KVMIO,  0xe3, struct
  kvm_device_attr)
 
  +/* ioctl for SPAPR TCE IOMMU */
  +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
  kvm_create_spapr_tce_iommu)
 
  Shouldn't this go under the vm ioctl section?


 The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated devices) is
 in this section so I decided to keep them together. Wrong?
 
 You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
 KVM_CREATE_SPAPR_TCE_IOMMU?

Yes.


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-28 Thread Alexey Kardashevskiy
On 05/29/2013 09:35 AM, Scott Wood wrote:
 On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
   @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
   #define KVM_GET_DEVICE_ATTR  _IOW(KVMIO,  0xe2, struct
   kvm_device_attr)
   #define KVM_HAS_DEVICE_ATTR  _IOW(KVMIO,  0xe3, struct
   kvm_device_attr)
  
   +/* ioctl for SPAPR TCE IOMMU */
   +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
   kvm_create_spapr_tce_iommu)
  
   Shouldn't this go under the vm ioctl section?
 
 
  The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
 devices) is
  in this section so I decided to keep them together. Wrong?
 
  You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
  KVM_CREATE_SPAPR_TCE_IOMMU?

 Yes.
 
 Sigh.  That's the same thing repeated.  There's only one IOCTL.  Nothing is
 being kept together.

Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-28 Thread Alexey Kardashevskiy
On 05/29/2013 02:32 AM, Scott Wood wrote:
 On 05/24/2013 09:45:24 PM, David Gibson wrote:
 On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
  On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
  diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
  index 8465c2a..da6bf61 100644
  --- a/arch/powerpc/kvm/powerpc.c
  @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
  +++ b/arch/powerpc/kvm/powerpc.c
   break;
   #endif
   case KVM_CAP_SPAPR_MULTITCE:
  +case KVM_CAP_SPAPR_TCE_IOMMU:
   r = 1;
   break;
   default:
 
  Don't advertise SPAPR capabilities if it's not book3s -- and
  probably there's some additional limitation that would be
  appropriate.

 So, in the case of MULTITCE, that's not quite right.  PR KVM can
 emulate a PAPR system on a BookE machine, and there's no reason not to
 allow TCE acceleration as well.
 
 That might (or might not; consider things like Altivec versus SPE opcode
 conflict, whether unimplemented SPRs trap, behavior of unprivileged
 SPRs/instructions, etc) be true in theory, but it's not currently a
 supported configuration.  BookE KVM does not support emulating a different
 CPU than the host.  In the unlikely case that ever changes to the point of
 allowing PAPR guests on a BookE host, then we can revisit how to properly
 determine whether the capability is supported, but for now the capability
 will never be valid in the CONFIG_BOOKE case (though I'd rather see it
 depend on an appropriate book3s symbol than depend on !BOOKE).
 
 Or we could just leave it as is, and let it indicate whether the host
 kernel supports the feature in general, with the user needing to understand
 when it's applicable...  I'm a bit confused by the documentation, however
 -- the MULTITCE capability was documented in the capabilities that can be
 enabled section, but I don't see where it can be enabled.


True, it cannot be enabled (but it could be enabled a long time ago), it is
either supported or not, I'll fix the documentation. Thanks!


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-29 Thread Alexey Kardashevskiy
On 05/30/2013 06:05 AM, Scott Wood wrote:
 On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
 On 05/29/2013 09:35 AM, Scott Wood wrote:
  On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
@@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
#define KVM_GET_DEVICE_ATTR  _IOW(KVMIO,  0xe2, struct
kvm_device_attr)
#define KVM_HAS_DEVICE_ATTR  _IOW(KVMIO,  0xe3, struct
kvm_device_attr)
   
+/* ioctl for SPAPR TCE IOMMU */
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
kvm_create_spapr_tce_iommu)
   
Shouldn't this go under the vm ioctl section?
  
  
   The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
  devices) is
   in this section so I decided to keep them together. Wrong?
  
   You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
   KVM_CREATE_SPAPR_TCE_IOMMU?
 
  Yes.
 
  Sigh.  That's the same thing repeated.  There's only one IOCTL. 
 Nothing is
  being kept together.

 Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.
 
 But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE.  0xe0
 begins a different section.

It is not really obvious that there are sections as no comment defines
those :) But yes, makes sense to move it up a bit and change the code to 0xad.



-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-29 Thread Alexey Kardashevskiy
On 05/30/2013 09:14 AM, Scott Wood wrote:
 On 05/29/2013 06:10:33 PM, Alexey Kardashevskiy wrote:
 On 05/30/2013 06:05 AM, Scott Wood wrote:
  On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
  On 05/29/2013 09:35 AM, Scott Wood wrote:
   On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
 @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR  _IOW(KVMIO,  0xe2, struct
 kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR  _IOW(KVMIO,  0xe3, struct
 kvm_device_attr)

 +/* ioctl for SPAPR TCE IOMMU */
 +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
 kvm_create_spapr_tce_iommu)

 Shouldn't this go under the vm ioctl section?
   
   
The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
   devices) is
in this section so I decided to keep them together. Wrong?
   
You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
KVM_CREATE_SPAPR_TCE_IOMMU?
  
   Yes.
  
   Sigh.  That's the same thing repeated.  There's only one IOCTL.
  Nothing is
   being kept together.
 
  Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.
 
  But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE.  0xe0
  begins a different section.

 It is not really obvious that there are sections as no comment defines
 those :)
 
 There is a comment /* ioctls for fds returned by KVM_CREATE_DEVICE */
 
 Putting KVM_CREATE_DEVICE in there was mainly to avoid dealing with the
 ioctl number conflict mess in the vm-ioctl section, but at least that one
 is related to the device control API. :-)
 
 But yes, makes sense to move it up a bit and change the code to 0xad.
 
 0xad is KVM_KVMCLOCK_CTRL

That's it. I am _completely_ confused now. No system whatsoever :(
What rule should I use in order to choose the number for my new ioctl? :)



-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-06-05 Thread Alexey Kardashevskiy
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a fast hypercall directly in real
mode (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Reviewed-by: Paul Mackerras pau...@samba.org
Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Paul Mackerras pau...@samba.org

---

Changes:
2013-05-20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
---
 arch/powerpc/include/asm/pgtable-ppc64.h |4 ++
 arch/powerpc/mm/init_64.c|   77 +-
 2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..7b46e5f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t 
*pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a90b9c4..ce3d8d4 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,80 @@ void vmemmap_free(unsigned long start, unsigned long end)
 {
 }
 
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_ functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct vmemmap_backing *vmem_back;
+   struct page *page;
+   unsigned long page_size = 1  mmu_psize_defs[mmu_vmemmap_psize].shift;
+   unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+   for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back-list) {
+   if (pg_va  vmem_back-virt_addr)
+   continue;
 
+   /* Check that page struct is not split between real pages */
+   if ((pg_va + sizeof(struct page)) 
+   (vmem_back-virt_addr + page_size))
+   return NULL;
+
+   page = (struct page *) (vmem_back-phys + pg_va -
+   vmem_back-virt_addr);
+   return page;
+   }
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct page *page = pfn_to_page(pfn);
+   return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+   if (PageCompound(page))
+   return -EAGAIN;
+
+   get_page(page);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+   if (PageCompound(page))
+   return -EAGAIN;
+
+   if (!atomic_add_unless(page-_count, -1, 1))
+   return -EAGAIN;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http

[PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls

2013-06-05 Thread Alexey Kardashevskiy
This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the hcall-multi-tce
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Signed-off-by: Paul Mackerras pau...@samba.org

---
Changelog:
2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number

2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
---
 Documentation/virtual/kvm/api.txt   |   17 ++
 arch/powerpc/include/asm/kvm_host.h |2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   16 +-
 arch/powerpc/kvm/book3s_64_vio.c|  118 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c |  266 +++
 arch/powerpc/kvm/book3s_hv.c|   39 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 +
 arch/powerpc/kvm/book3s_pr_papr.c   |   37 -
 arch/powerpc/kvm/powerpc.c  |3 +
 include/uapi/linux/kvm.h|1 +
 10 files changed, 473 insertions(+), 32 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 5f91eda..6c082ff 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to 
userspace to be
 handled.
 
 
+4.83 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability tells the guest that multiple TCE entry add/remove hypercalls
+handling is supported by the kernel. This significanly accelerates DMA
+operations for PPC KVM guests.
+
+Unlike other capabilities in this section, this one does not have an ioctl.
+Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
+H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to
+the guest. Othwerwise it might be better for the guest to continue using 
H_PUT_TCE
+hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index af326cd..85d8f26 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+   unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hall */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..e852921b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+   struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_validate_tce(unsigned long tce);
+extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+   unsigned long ioba, unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct

[PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-05 Thread Alexey Kardashevskiy
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which should
save time on switching to QEMU and back.

Both real and virtual modes are supported - whenever the kernel
fails to handle TCE request, it passes it to the virtual mode.
If it the virtual mode handlers fail, then the request is passed
to the user mode, for example, to QEMU.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
in-kernel handling of IOMMU map/unmap.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Signed-off-by: Paul Mackerras pau...@samba.org

---

Changes:
2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number

2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
 Documentation/virtual/kvm/api.txt   |   28 +
 arch/powerpc/include/asm/kvm_host.h |3 +
 arch/powerpc/include/asm/kvm_ppc.h  |2 +
 arch/powerpc/include/uapi/asm/kvm.h |7 ++
 arch/powerpc/kvm/book3s_64_vio.c|  198 ++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  193 +-
 arch/powerpc/kvm/powerpc.c  |   12 +++
 include/uapi/linux/kvm.h|2 +
 8 files changed, 439 insertions(+), 6 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 6c082ff..e962e3b 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2379,6 +2379,34 @@ the guest. Othwerwise it might be better for the guest 
to continue using H_PUT_T
 hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
 
 
+4.84 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+   __u64 liobn;
+   __u32 iommu_id;
+   __u32 flags;
+};
+
+No flag is supported at the moment.
+
+When the guest issues TCE call on a liobn for which a TCE table has been
+registered, the kernel will handle it in real mode, updating the hardware
+TCE table. TCE table calls for other liobns will cause a vm exit and must
+be handled by userspace.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 85d8f26..ac0e2fe 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+   struct iommu_group *grp;/* used for IOMMU groups */
struct page *pages[0];
 };
 
@@ -611,6 +612,8 @@ struct kvm_vcpu_arch {
u64 busy_preempt;
 
unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hall */
+   unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */
+   unsigned long tce_reason;  /* The reason of switching to the virtmode */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index e852921b..934e01d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+   struct kvm_create_spapr_tce_iommu *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_emulated_validate_tce(unsigned long tce);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..cf82af4 100644
--- a/arch/powerpc/include/uapi/asm

[PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-06-05 Thread Alexey Kardashevskiy
This adds special support for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Signed-off-by: Paul Mackerras pau...@samba.org

---

Changes:
2013/06/05:
* fixed compile error when CONFIG_IOMMU_API=n

2013/05/20:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).
---
 arch/powerpc/include/asm/kvm_host.h |2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   22 +
 arch/powerpc/kvm/book3s_64_vio.c|   88 +--
 arch/powerpc/kvm/book3s_64_vio_hv.c |   40 ++--
 4 files changed, 146 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index ac0e2fe..4fc0865 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table {
u64 liobn;
u32 window_size;
struct iommu_group *grp;/* used for IOMMU groups */
+   struct list_head hugepages; /* used for IOMMU groups */
+   spinlock_t hugepages_lock;  /* used for IOMMU groups */
struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 934e01d..9054df0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct 
kvm_vcpu *vcpu,
 extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
unsigned long liobn, unsigned long ioba,
unsigned long tce_value, unsigned long npages);
+
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct kvmppc_iommu_hugepage {
+   struct list_head list;
+   pte_t pte;  /* Huge page PTE */
+   unsigned long gpa;  /* Guest physical address */
+   struct page *page;  /* page struct of the very first subpage */
+   unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index ffb4698..9e2ba4d 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,6 +45,71 @@
 #define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
 #define ERROR_ADDR  ((void *)~(unsigned long)0x0)
 
+#ifdef CONFIG_IOMMU_API
+/* Adds a new huge page descriptor to the list  */
+static long kvmppc_iommu_hugepage_try_add(
+   struct kvmppc_spapr_tce_table *tt,
+   pte_t pte, unsigned long hva, unsigned long gpa,
+   unsigned long pg_size)
+{
+   long ret = 0;
+   struct kvmppc_iommu_hugepage *hp;
+   struct page *p;
+
+   spin_lock(tt-hugepages_lock);
+   list_for_each_entry(hp, tt-hugepages, list) {
+   if (hp-pte == pte)
+   goto unlock_exit;
+   }
+
+   hva = hva  ~(pg_size - 1);
+   ret = get_user_pages_fast(hva, 1, true/*write*/, p);
+   if ((ret != 1) || !p) {
+   ret = -EFAULT;
+   goto unlock_exit;
+   }
+   ret = 0;
+
+   hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+   if (!hp) {
+   ret = -ENOMEM;
+   goto unlock_exit;
+   }
+
+   hp-page = p;
+   hp-pte = pte;
+   hp-gpa = gpa  ~(pg_size - 1);
+   hp

Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls

2013-06-17 Thread Alexey Kardashevskiy
On 06/17/2013 08:06 AM, Alexander Graf wrote:
 
 On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
 
 This adds real mode handlers for the H_PUT_TCE_INDIRECT and
 H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
 devices or emulated PCI.  These calls allow adding multiple entries
 (up to 512) into the TCE table in one call which saves time on
 transition to/from real mode.

 This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
 (copied from user and verified) before writing the whole list into
 the TCE table. This cache will be utilized more in the upcoming
 VFIO/IOMMU support to continue TCE list processing in the virtual
 mode in the case if the real mode handler failed for some reason.

 This adds a guest physical to host real address converter
 and calls the existing H_PUT_TCE handler. The converting function
 is going to be fully utilized by upcoming VFIO supporting patches.

 This also implements the KVM_CAP_PPC_MULTITCE capability,
 so in order to support the functionality of this patch, QEMU
 needs to query for this capability and set the hcall-multi-tce
 hypertas property only if the capability is present, otherwise
 there will be serious performance degradation.

 Cc: David Gibson da...@gibson.dropbear.id.au
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Signed-off-by: Paul Mackerras pau...@samba.org
 
 Only a few minor nits. Ben already commented on implementation details.
 

 ---
 Changelog:
 2013/06/05:
 * fixed mistype about IBMVIO in the commit message
 * updated doc and moved it to another section
 * changed capability number

 2013/05/21:
 * added kvm_vcpu_arch::tce_tmp
 * removed cleanup if put_indirect failed, instead we do not even start
 writing to TCE table if we cannot get TCEs from the user and they are
 invalid
 * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
 and kvmppc_emulated_validate_tce (for the previous item)
 * fixed bug with failthrough for H_IPI
 * removed all get_user() from real mode handlers
 * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
 ---
 Documentation/virtual/kvm/api.txt   |   17 ++
 arch/powerpc/include/asm/kvm_host.h |2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   16 +-
 arch/powerpc/kvm/book3s_64_vio.c|  118 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c |  266 
 +++
 arch/powerpc/kvm/book3s_hv.c|   39 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 +
 arch/powerpc/kvm/book3s_pr_papr.c   |   37 -
 arch/powerpc/kvm/powerpc.c  |3 +
 include/uapi/linux/kvm.h|1 +
 10 files changed, 473 insertions(+), 32 deletions(-)

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 5f91eda..6c082ff 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to 
 userspace to be
 handled.


 +4.83 KVM_CAP_PPC_MULTITCE
 +
 +Capability: KVM_CAP_PPC_MULTITCE
 +Architectures: ppc
 +Type: vm
 +
 +This capability tells the guest that multiple TCE entry add/remove 
 hypercalls
 +handling is supported by the kernel. This significanly accelerates DMA
 +operations for PPC KVM guests.
 +
 +Unlike other capabilities in this section, this one does not have an ioctl.
 +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
 +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed 
 to
 +the guest. Othwerwise it might be better for the guest to continue using 
 H_PUT_TCE
 +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
 

 While this describes perfectly well what the consequences are of the
 patches, it does not describe properly what the CAP actually expresses.
 The CAP only says this kernel is able to handle H_PUT_TCE_INDIRECT and
 H_STUFF_TCE hypercalls directly. All other consequences are nice to
 document, but the semantics of the CAP are missing.


? It expresses ability to handle 2 hcalls. What is missing?


 We also usually try to keep KVM behavior unchanged with regards to older
 versions until a CAP is enabled. In this case I don't think it matters
 all that much, so I'm fine with declaring it as enabled by default.
 Please document that this is a change in behavior versus older KVM
 versions though.


Ok!


 +
 +
 5. The kvm_run structure
 

 diff --git a/arch/powerpc/include/asm/kvm_host.h 
 b/arch/powerpc/include/asm/kvm_host.h
 index af326cd..85d8f26 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
  spinlock_t tbacct_lock;
  u64 busy_stolen;
  u64 busy_preempt;
 +
 +unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hall */
 #endif
 };
 
 [...]


 
 [...]
 
 diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
 index 550f592

Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls

2013-06-17 Thread Alexey Kardashevskiy
On 06/17/2013 06:40 PM, Alexander Graf wrote:
 
 On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote:
 
 On 06/17/2013 06:02 PM, Alexander Graf wrote:
 
 On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:
 
 On 06/17/2013 08:06 AM, Alexander Graf wrote:
 
 On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
 
 This adds real mode handlers for the H_PUT_TCE_INDIRECT and 
 H_STUFF_TCE hypercalls for QEMU emulated devices such as
 IBMVIO devices or emulated PCI.  These calls allow adding
 multiple entries (up to 512) into the TCE table in one call
 which saves time on transition to/from real mode.
 
 This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs 
 (copied from user and verified) before writing the whole list
 into the TCE table. This cache will be utilized more in the
 upcoming VFIO/IOMMU support to continue TCE list processing in
 the virtual mode in the case if the real mode handler failed
 for some reason.
 
 This adds a guest physical to host real address converter and
 calls the existing H_PUT_TCE handler. The converting function 
 is going to be fully utilized by upcoming VFIO supporting
 patches.
 
 This also implements the KVM_CAP_PPC_MULTITCE capability, so
 in order to support the functionality of this patch, QEMU 
 needs to query for this capability and set the
 hcall-multi-tce hypertas property only if the capability is
 present, otherwise there will be serious performance
 degradation.
 
 Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by:
 Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul
 Mackerras pau...@samba.org
 
 Only a few minor nits. Ben already commented on implementation
 details.
 
 
 --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the
 commit message * updated doc and moved it to another section *
 changed capability number
 
 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup
 if put_indirect failed, instead we do not even start writing
 to TCE table if we cannot get TCEs from the user and they are 
 invalid * kvmppc_emulated_h_put_tce is split to
 kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for
 the previous item) * fixed bug with failthrough for H_IPI *
 removed all get_user() from real mode handlers *
 kvmppc_lookup_pte() added (instead of making lookup_linux_pte
 public) --- Documentation/virtual/kvm/api.txt   |   17 ++ 
 arch/powerpc/include/asm/kvm_host.h |2 + 
 arch/powerpc/include/asm/kvm_ppc.h  |   16 +- 
 arch/powerpc/kvm/book3s_64_vio.c|  118 ++ 
 arch/powerpc/kvm/book3s_64_vio_hv.c |  266
 +++ arch/powerpc/kvm/book3s_hv.c
 |   39 + arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 + 
 arch/powerpc/kvm/book3s_pr_papr.c   |   37 - 
 arch/powerpc/kvm/powerpc.c  |3 + 
 include/uapi/linux/kvm.h|1 + 10 files
 changed, 473 insertions(+), 32 deletions(-)
 
 diff --git a/Documentation/virtual/kvm/api.txt
 b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff
 100644 --- a/Documentation/virtual/kvm/api.txt +++
 b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@
 calls by the guest for that service will be passed to
 userspace to be handled.
 
 
 +4.83 KVM_CAP_PPC_MULTITCE + +Capability:
 KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This
 capability tells the guest that multiple TCE entry add/remove
 hypercalls +handling is supported by the kernel. This
 significanly accelerates DMA +operations for PPC KVM guests. 
 + +Unlike other capabilities in this section, this one does
 not have an ioctl. +Instead, when the capability is present,
 the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be
 handled in the host kernel and not passed to +the guest.
 Othwerwise it might be better for the guest to continue using
 H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or
 KVM_CAP_SPAPR_TCE_IOMMU are present).
 
 
 While this describes perfectly well what the consequences are of
 the patches, it does not describe properly what the CAP actually
 expresses. The CAP only says this kernel is able to handle
 H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly. All
 other consequences are nice to document, but the semantics of
 the CAP are missing.
 
 
 ? It expresses ability to handle 2 hcalls. What is missing?
 
 You don't describe the kvm - qemu interface. You describe some
 decisions qemu can take from this cap.
 
 
 This file does not mention qemu at all. And the interface is - qemu
 (or kvmtool could do that) just adds hcall-multi-tce to 
 ibm,hypertas-functions but this is for pseries linux and AIX could
 always do it (no idea about it). Does it really have to be in this
 file?
 

 Ok, let's go back a step. What does this CAP describe? Don't look at the
 description you wrote above. Just write a new one.

The CAP means the kernel is capable of handling hcalls A and B without
passing those into the user space. That accelerates DMA.


 What exactly can user space expect when it finds

Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-06-17 Thread Alexey Kardashevskiy
On 06/16/2013 02:26 PM, Benjamin Herrenschmidt wrote:
 +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
 +int realmode_get_page(struct page *page)
 +{
 +if (PageCompound(page))
 +return -EAGAIN;
 +
 +get_page(page);
 +
 +return 0;
 +}
 +EXPORT_SYMBOL_GPL(realmode_get_page);
 +
 +int realmode_put_page(struct page *page)
 +{
 +if (PageCompound(page))
 +return -EAGAIN;
 +
 +if (!atomic_add_unless(page-_count, -1, 1))
 +return -EAGAIN;
 +
 +return 0;
 +}
 +EXPORT_SYMBOL_GPL(realmode_put_page);
 +#endif
 
 Several worries here, mostly that if the generic code ever changes
 (something gets added to get_page() that makes it no-longer safe for use
 in real mode for example, or some other condition gets added to
 put_page()), we go out of sync and potentially end up with very hard and
 very subtle bugs.
 
 It might be worth making sure that:
 
  - This is reviewed by some generic VM people (and make sure they
 understand why we need to do that)
 
  - A comment is added to get_page() and put_page() to make sure that if
 they are changed in any way, dbl check the impact on our
 realmode_get_page() (or ping us to make sure things are still ok).

After changing get_page() to get_page_unless_zero(), the get_page API I use is:
get_page_unless_zero() - basically atomic_inc_not_zero()
atomic_add_unless() - just operated with the counter
PageCompound() - check if it is a huge page.

No usage of get_page or put_page.

If any of those changes, I would expect it to hit us immediately, no?

So it may only make sense to add a comment to PageCompound(). But the
comment says PageCompound is generally not used in hot code paths, and
our path is hot. Heh.

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..c70a654 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -329,7 +329,8 @@ static inline void set_page_writeback(struct page *page)
  * System with lots of page flags available. This allows separate
  * flags for PageHead() and PageTail() checks of compound pages so that bit
  * tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths.
+ * generally not used in hot code paths except arch/powerpc/mm/init_64.c
+ * which uses it to detect huge pages and avoid handling those in real mode.
  */
 __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
 __PAGEFLAG(Tail, tail)


So?


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-18 Thread Alexey Kardashevskiy
On 06/16/2013 02:39 PM, Benjamin Herrenschmidt wrote:
  static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool 
 writing,
 -unsigned long *pte_sizep)
 +unsigned long *pte_sizep, bool do_get_page)
  {
  pte_t *ptep;
  unsigned int shift = 0;
 @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned 
 long hva, bool writing,
  if (!pte_present(*ptep))
  return __pte(0);
  
 +/*
 + * Put huge pages handling to the virtual mode.
 + * The only exception is for TCE list pages which we
 + * do need to call get_page() for.
 + */
 +if ((*pte_sizep  PAGE_SIZE)  do_get_page)
 +return __pte(0);
 +
  /* wait until _PAGE_BUSY is clear then set it atomically */
  __asm__ __volatile__ (
  1: ldarx   %0,0,%3\n
 @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned 
 long hva, bool writing,
  : cc);
  
  ret = pte;
 +if (do_get_page  pte_present(pte)  (!writing || pte_write(pte))) {
 +struct page *pg = NULL;
 +pg = realmode_pfn_to_page(pte_pfn(pte));
 +if (realmode_get_page(pg)) {
 +ret = __pte(0);
 +} else {
 +pte = pte_mkyoung(pte);
 +if (writing)
 +pte = pte_mkdirty(pte);
 +}
 +}
 +*ptep = pte;/* clears _PAGE_BUSY */
  
  return ret;
  }
 
 So now you are adding the clearing of _PAGE_BUSY that was missing for
 your first patch, except that this is not enough since that means that
 in the emulated case (ie, !do_get_page) you will in essence return
 and then use a PTE that is not locked without any synchronization to
 ensure that the underlying page doesn't go away... then you'll
 dereference that page.
 
 So either make everything use speculative get_page, or make the emulated
 case use the MMU notifier to drop the operation in case of collision.
 
 The former looks easier.
 
 Also, any specific reason why you do:
 
   - Lock the PTE
   - get_page()
   - Unlock the PTE
 
 Instead of
 
   - Read the PTE
   - get_page_unless_zero
   - re-check PTE
 
 Like get_user_pages_fast() does ?
 
 The former will be two atomic ops, the latter only one (faster), but
 maybe you have a good reason why that can't work...



If we want to set dirty and young bits for pte then I do not know how
to avoid _PAGE_BUSY.



-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-19 Thread Alexey Kardashevskiy
On 06/20/2013 01:49 AM, Alex Williamson wrote:
 On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote:
 On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote:

 Alex, any objection ?

 Which Alex? :)

 Heh, mostly Williamson in this specific case but your input is still
 welcome :-)

 I think validate works, it keeps iteration logic out of the kernel
 which is a good thing. There still needs to be an interface for
 getting the iommu id in VFIO, but I suppose that one's for the other
 Alex and Jörg to comment on.

 I think getting the iommu fd is already covered by separate patches from
 Alexey.


 Do we need to make it a get/put interface instead ?

vfio_validate_and_use_iommu(file, iommu_id);

vfio_release_iommu(file, iommu_id);

 To ensure that the resource remains owned by the process until KVM
 is closed as well ?

 Or do we want to register with VFIO with a callback so that VFIO can
 call us if it needs us to give it up ?

 Can't we just register a handler on the fd and get notified when it
 closes? Can you kill VFIO access without closing the fd?

 That sounds actually harder :-)

 The question is basically: When we validate that relationship between a
 specific VFIO struct file with an iommu, what is the lifetime of that
 and how do we handle this lifetime properly.

 There's two ways for that sort of situation: The notification model
 where we get notified when the relationship is broken, and the refcount
 model where we become a user and thus delay the breaking of the
 relationship until we have been disposed of as well.

 In this specific case, it's hard to tell what is the right model from my
 perspective, which is why I would welcome Alex (W.) input.

 In the end, the solution will end up being in the form of APIs exposed
 by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W),
 as owner of VFIO at this stage, what do you want those to look
 like ? :-)
 
 My first thought is that we should use the same reference counting as we
 have for vfio devices (group-container_users).  An interface for that
 might look like:
 
 int vfio_group_add_external_user(struct file *filep)
 {
   struct vfio_group *group = filep-private_data;
 
   if (filep-f_op != vfio_group_fops)
   return -EINVAL;
 
 
   if (!atomic_inc_not_zero(group-container_users))
   return -EINVAL;
 
   return 0;
 }
 
 void vfio_group_del_external_user(struct file *filep)
 {
   struct vfio_group *group = filep-private_data;
 
   BUG_ON(filep-f_op != vfio_group_fops);
 
   vfio_group_try_dissolve_container(group);
 }
 
 int vfio_group_iommu_id_from_file(struct file *filep)
 {
   struct vfio_group *group = filep-private_data;
 
   BUG_ON(filep-f_op != vfio_group_fops);
 
   return iommu_group_id(group-iommu_group);
 }
 
 Would that work?  Thanks,


Just out of curiosity - would not get_file() and fput_atomic() on a group's
file* do the right job instead of vfio_group_add_external_user() and
vfio_group_del_external_user()?



-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-20 Thread Alexey Kardashevskiy
On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
 On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
 Just out of curiosity - would not get_file() and fput_atomic() on a
 group's
 file* do the right job instead of vfio_group_add_external_user() and
 vfio_group_del_external_user()?

 I was thinking that too.  Grabbing a file reference would certainly be
 the usual way of handling this sort of thing.
 
 But that wouldn't prevent the group ownership to be returned to
 the kernel or another user would it ?


Holding the file pointer does not let the group-container_users counter go
to zero and this is exactly what vfio_group_add_external_user() and
vfio_group_del_external_user() do. The difference is only in absolute value
- 2 vs. 3.

No change in behaviour whether I use new vfio API or simply hold file* till
KVM closes fd created when IOMMU was connected to LIOBN.

And while this counter is not zero, QEMU cannot take ownership over the group.

I am definitely still missing the bigger picture...


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-22 Thread Alexey Kardashevskiy
On 06/21/2013 12:55 AM, Alex Williamson wrote:
 On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote:
 On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote:
 On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote:
 Just out of curiosity - would not get_file() and fput_atomic() on a
 group's
 file* do the right job instead of vfio_group_add_external_user() and
 vfio_group_del_external_user()?

 I was thinking that too.  Grabbing a file reference would certainly be
 the usual way of handling this sort of thing.

 But that wouldn't prevent the group ownership to be returned to
 the kernel or another user would it ?


 Holding the file pointer does not let the group-container_users counter go
 to zero
 
 How so?  Holding the file pointer means the file won't go away, which
 means the group release function won't be called.  That means the group
 won't go away, but that doesn't mean it's attached to an IOMMU.  A user
 could call UNSET_CONTAINER.
 
  and this is exactly what vfio_group_add_external_user() and
 vfio_group_del_external_user() do. The difference is only in absolute value
 - 2 vs. 3.

 No change in behaviour whether I use new vfio API or simply hold file* till
 KVM closes fd created when IOMMU was connected to LIOBN.
 
 By that notion you could open(/dev/vfio/$GROUP) and you're safe, right?
 But what about SET_CONTAINER  SET_IOMMU?  All that you guarantee
 holding the file pointer is that the vfio_group exists.
 
 And while this counter is not zero, QEMU cannot take ownership over the 
 group.

 I am definitely still missing the bigger picture...
 
 The bigger picture is that the group needs to exist AND it needs to be
 setup and maintained to have IOMMU protection.  Actually, my first stab
 at add_external_user doesn't look sufficient, it needs to look more like
 vfio_group_get_device_fd, checking group-container-iommu and
 group_viable().


This makes sense. If you did this, that would be great. Without it, I
really cannot see how the proposed inc/dec of container_users is better
than simple holding file*. Thanks.


 As written it would allow an external user after
 SET_CONTAINER without SET_IOMMU.  It should also be part of the API that
 the external user must hold the file reference between add_external_use
 and del_external_user and do cleanup on any exit paths.  Thanks,


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8 v4] KVM: PPC: IOMMU in-kernel handling

2013-06-26 Thread Alexey Kardashevskiy
The changes are:
1. rebased on v3.10-rc7
2. removed spinlocks from real mode
3. added security checks between KVM and VFIO

MOre details in the individual patch comments.


Alexey Kardashevskiy (8):
  KVM: PPC: reserve a capability number for multitce support
  KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
  vfio: add external user support
  hashtable: add hash_for_each_possible_rcu_notrace()
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  KVM: PPC: Add support for multiple-TCE hcalls
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt|   51 +++
 arch/powerpc/include/asm/kvm_host.h  |   31 ++
 arch/powerpc/include/asm/kvm_ppc.h   |   18 +-
 arch/powerpc/include/asm/pgtable-ppc64.h |4 +
 arch/powerpc/include/uapi/asm/kvm.h  |8 +
 arch/powerpc/kvm/book3s_64_vio.c |  506 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c  |  439 --
 arch/powerpc/kvm/book3s_hv.c |   41 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |6 +
 arch/powerpc/kvm/book3s_pr_papr.c|   37 ++-
 arch/powerpc/kvm/powerpc.c   |   15 +
 arch/powerpc/mm/init_64.c|   78 -
 drivers/vfio/vfio.c  |   53 
 include/linux/hashtable.h|   15 +
 include/linux/page-flags.h   |4 +-
 include/uapi/linux/kvm.h |3 +
 16 files changed, 1279 insertions(+), 30 deletions(-)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-26 Thread Alexey Kardashevskiy
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which saves time
on switching to QEMU and back.

Both real and virtual modes are supported. First the kernel tries to
handle a TCE request in the real mode, if failed it passes it to
the virtual mode to complete the operation. If it a virtual mode
handler fails, a request is passed to the user mode.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to associate
a virtual PCI bus ID (LIOBN) with an IOMMU group which enables
in-kernel handling of IOMMU map/unmap. The external user API support
in VFIO is required.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

---

Changes:
2013/06/27:
* tce_list page is referenced now in order to protect it from accident
invalidation during H_PUT_TCE_INDIRECT execution
* added use of the external user VFIO API

2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number

2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
 Documentation/virtual/kvm/api.txt   |   26 
 arch/powerpc/include/asm/kvm_host.h |4 +
 arch/powerpc/include/asm/kvm_ppc.h  |2 +
 arch/powerpc/include/uapi/asm/kvm.h |8 +
 arch/powerpc/kvm/book3s_64_vio.c|  294 ++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  165 
 arch/powerpc/kvm/powerpc.c  |   12 ++
 7 files changed, 509 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 762c703..01b0dc2 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2387,6 +2387,32 @@ slows operations a lot.
 Unlike other capabilities of this section, this one is always enabled.
 
 
+4.87 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_create_spapr_tce_iommu {
+   __u64 liobn;
+   __u32 iommu_id;
+   __u32 flags;
+};
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+In response to a TCE hypercall, the kernel looks for a TCE table descriptor
+in the list and handles the hypercall in real or virtual modes if
+the descriptor is found. Otherwise the hypercall is passed to the user mode.
+
+No flag is supported at the moment.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 3bf407b..716ab18 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,8 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+   struct iommu_group *grp;/* used for IOMMU groups */
+   struct file *vfio_filp; /* used for IOMMU groups */
struct page *pages[0];
 };
 
@@ -611,6 +613,8 @@ struct kvm_vcpu_arch {
u64 busy_preempt;
 
unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hcall */
+   unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */
+   unsigned long tce_reason;  /* The reason of switching to the virtmode */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index e852921b..934e01d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+   struct kvm_create_spapr_tce_iommu *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
struct kvm_vcpu *vcpu, unsigned long liobn);
 extern long kvmppc_emulated_validate_tce(unsigned long tce);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch

[PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-06-26 Thread Alexey Kardashevskiy
This adds special support for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

---

Changes:
2013/06/27:
* list of huge pages replaces with hashtable for better performance
* spinlock removed from real mode and only protects insertion of new
huge [ages descriptors into the hashtable

2013/06/05:
* fixed compile error when CONFIG_IOMMU_API=n

2013/05/20:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).
---
 arch/powerpc/include/asm/kvm_host.h |   25 +
 arch/powerpc/kvm/book3s_64_vio.c|   95 +--
 arch/powerpc/kvm/book3s_64_vio_hv.c |   24 +++--
 3 files changed, 138 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 716ab18..0ad6189 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
 #include linux/kvm_para.h
 #include linux/list.h
 #include linux/atomic.h
+#include linux/hashtable.h
 #include asm/kvm_asm.h
 #include asm/processor.h
 #include asm/page.h
@@ -182,9 +183,33 @@ struct kvmppc_spapr_tce_table {
u32 window_size;
struct iommu_group *grp;/* used for IOMMU groups */
struct file *vfio_filp; /* used for IOMMU groups */
+   DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
+   spinlock_t hugepages_write_lock;/* used for IOMMU groups */
struct page *pages[0];
 };
 
+/*
+ * The KVM guest can be backed with 16MB pages.
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+#define KVMPPC_HUGEPAGE_HASH(gpa)  hash_32(gpa  24, 32)
+
+struct kvmppc_iommu_hugepage {
+   struct hlist_node hash_node;
+   unsigned long gpa;  /* Guest physical address */
+   unsigned long hpa;  /* Host physical address */
+   struct page *page;  /* page struct of the very first subpage */
+   unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
 struct kvmppc_linear_info {
void*base_virt;
unsigned longbase_pfn;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index a5d0195..6cedfe9 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -47,6 +47,78 @@
 #define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
 #define ERROR_ADDR  ((void *)~(unsigned long)0x0)
 
+#ifdef CONFIG_IOMMU_API
+/* Adds a new huge page descriptor to the hashtable */
+static long kvmppc_iommu_hugepage_try_add(
+   struct kvmppc_spapr_tce_table *tt,
+   pte_t pte, unsigned long hva, unsigned long gpa,
+   unsigned long pg_size)
+{
+   long ret = 0;
+   struct kvmppc_iommu_hugepage *hp;
+   struct page *pg;
+   unsigned key = KVMPPC_HUGEPAGE_HASH(gpa);
+
+   spin_lock(tt-hugepages_write_lock);
+   hash_for_each_possible_rcu(tt-hash_tab, hp, hash_node, key) {
+   if (KVMPPC_HUGEPAGE_HASH(hp-gpa) != key)
+   continue;
+   if ((gpa  hp-gpa) || (gpa = hp-gpa + hp-size))
+   continue;
+   goto unlock_exit;
+   }
+
+   hva = hva  ~(pg_size - 1);
+   ret = get_user_pages_fast(hva, 1, true/*write*/, pg);
+   if ((ret != 1) || !pg) {
+   ret = -EFAULT;
+   goto unlock_exit;
+   }
+   ret = 0;
+
+   hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+   if (!hp) {
+   ret = -ENOMEM;
+   goto unlock_exit;
+   }
+
+   hp-page = pg;
+   hp

[PATCH 5/8] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-06-26 Thread Alexey Kardashevskiy
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a fast hypercall directly in real
mode (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Reviewed-by: Paul Mackerras pau...@samba.org
Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

---

Changes:
2013/06/27:
* realmode_get_page() fixed to use get_page_unless_zero(). If failed,
the call will be passed from real to virtual mode and safely handled.
* added comment to PageCompound() in include/linux/page-flags.h.

2013/05/20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
---
 arch/powerpc/include/asm/pgtable-ppc64.h |4 ++
 arch/powerpc/mm/init_64.c|   78 +-
 include/linux/page-flags.h   |4 +-
 3 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..7b46e5f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t 
*pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a90b9c4..7031be3 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,81 @@ void vmemmap_free(unsigned long start, unsigned long end)
 {
 }
 
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_ functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct vmemmap_backing *vmem_back;
+   struct page *page;
+   unsigned long page_size = 1  mmu_psize_defs[mmu_vmemmap_psize].shift;
+   unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+   for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back-list) {
+   if (pg_va  vmem_back-virt_addr)
+   continue;
 
+   /* Check that page struct is not split between real pages */
+   if ((pg_va + sizeof(struct page)) 
+   (vmem_back-virt_addr + page_size))
+   return NULL;
+
+   page = (struct page *) (vmem_back-phys + pg_va -
+   vmem_back-virt_addr);
+   return page;
+   }
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct page *page = pfn_to_page(pfn);
+   return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+   if (PageCompound(page))
+   return -EAGAIN;
+
+   if (!get_page_unless_zero(page))
+   return -EAGAIN;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+   if (PageCompound(page))
+   return -EAGAIN

[PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

2013-06-26 Thread Alexey Kardashevskiy
This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the hcall-multi-tce
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

---
Changelog:
2013/06/27:
* fixed clear of BUSY bit in kvmppc_lookup_pte()
* H_PUT_TCE_INDIRECT does realmode_get_page() now
* KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
* updated doc

2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number

2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
---
 Documentation/virtual/kvm/api.txt   |   25 +++
 arch/powerpc/include/asm/kvm_host.h |2 +
 arch/powerpc/include/asm/kvm_ppc.h  |   16 +-
 arch/powerpc/kvm/book3s_64_vio.c|  123 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c |  270 +++
 arch/powerpc/kvm/book3s_hv.c|   41 -
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 +
 arch/powerpc/kvm/book3s_pr_papr.c   |   37 -
 arch/powerpc/kvm/powerpc.c  |3 +
 9 files changed, 490 insertions(+), 33 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 6365fef..762c703 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed to 
userspace to be
 handled.
 
 
+4.86 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significanly accelerates DMA operations for PPC KVM guests.
+The user space should expect that its handlers for these hypercalls
+are not going to be called.
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+the user space might have to advertise it for the guest. For example,
+IBM pSeries guest starts using them if hcall-multi-tce is present in
+the ibm,hypertas-functions device-tree property.
+
+Without this capability, only H_PUT_TCE is handled by the kernel and
+therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
+unless the capability is present as passing hypercalls to the userspace
+slows operations a lot.
+
+Unlike other capabilities of this section, this one is always enabled.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index af326cd..3bf407b 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+   unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hcall */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..e852921b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+   struct kvm_vcpu *vcpu

[PATCH 4/8] hashtable: add hash_for_each_possible_rcu_notrace()

2013-06-26 Thread Alexey Kardashevskiy
This adds hash_for_each_possible_rcu_notrace() which is basically
a notrace clone of hash_for_each_possible_rcu() which cannot be
used in real mode due to its tracing/debugging capability.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 include/linux/hashtable.h |   15 +++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
index a9df51f..af8b169 100644
--- a/include/linux/hashtable.h
+++ b/include/linux/hashtable.h
@@ -174,6 +174,21 @@ static inline void hash_del_rcu(struct hlist_node *node)
member)
 
 /**
+ * hash_for_each_possible_rcu_notrace - iterate over all possible objects 
hashing
+ * to the same bucket in an rcu enabled hashtable in a rcu enabled hashtable
+ * @name: hashtable to iterate
+ * @obj: the type * to use as a loop cursor for each entry
+ * @member: the name of the hlist_node within the struct
+ * @key: the key of the objects to iterate over
+ *
+ * This is the same as hash_for_each_possible_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define hash_for_each_possible_rcu_notrace(name, obj, member, key) \
+   hlist_for_each_entry_rcu_notrace(obj, name[hash_min(key, 
HASH_BITS(name))],\
+   member)
+
+/**
  * hash_for_each_possible_safe - iterate over all possible objects hashing to 
the
  * same bucket safe against removals
  * @name: hashtable to iterate
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] KVM: PPC: reserve a capability number for multitce support

2013-06-26 Thread Alexey Kardashevskiy
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 include/uapi/linux/kvm.h |1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d88c8ee..970b1f5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IRQ_MPIC 90
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
+#define KVM_CAP_SPAPR_MULTITCE 93
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] vfio: add external user support

2013-06-26 Thread Alexey Kardashevskiy
VFIO is designed to be used via ioctls on file descriptors
returned by VFIO.

However in some situations support for an external user is required.
The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
use the existing VFIO groups for exclusive access in real/virtual mode
in the host kernel to avoid passing map/unmap requests to the user
space which would made things pretty slow.

The proposed protocol includes:

1. do normal VFIO init stuff such as opening a new container, attaching
group(s) to it, setting an IOMMU driver for a container. When IOMMU is
set for a container, all groups in it are considered ready to use by
an external user.

2. pass a fd of the group we want to accelerate to KVM. KVM calls
vfio_group_iommu_id_from_file() to verify if the group is initialized
and IOMMU is set for it. The current TCE IOMMU driver marks the whole
IOMMU table as busy when IOMMU is set for a container what this prevents
other DMA users from allocating from it so it is safe to pass the group
to the user space.

3. KVM increases the container users counter via
vfio_group_add_external_user(). This prevents the VFIO group from
being disposed prior to exiting KVM.

4. When KVM is finished and doing cleanup, it releases the group file
and decrements the container users counter. Everything gets released.

5. KVM also keeps the group file as otherwise its fd might have been
closed at the moment of KVM finish so vfio_group_del_external_user()
call will not be possible.

The vfio: Limit group opens patch is also required for the consistency.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 drivers/vfio/vfio.c |   53 +++
 1 file changed, 53 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c488da5..54192b2 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1370,6 +1370,59 @@ static const struct file_operations vfio_device_fops = {
 };
 
 /**
+ * External user API, exported by symbols to be linked dynamically.
+ */
+
+/* Allows an external user (for example, KVM) to lock an IOMMU group */
+static int vfio_group_add_external_user(struct file *filep)
+{
+   struct vfio_group *group = filep-private_data;
+
+   if (filep-f_op != vfio_group_fops)
+   return -EINVAL;
+
+   if (!atomic_inc_not_zero(group-container_users))
+   return -EINVAL;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_group_add_external_user);
+
+/* Allows an external user (for example, KVM) to unlock an IOMMU group */
+static void vfio_group_del_external_user(struct file *filep)
+{
+   struct vfio_group *group = filep-private_data;
+
+   BUG_ON(filep-f_op != vfio_group_fops);
+
+   vfio_group_try_dissolve_container(group);
+}
+EXPORT_SYMBOL_GPL(vfio_group_del_external_user);
+
+/*
+ * Checks if a group for the specified file can be used by
+ * an external user and returns the IOMMU ID if external use is possible.
+ */
+static int vfio_group_iommu_id_from_file(struct file *filep)
+{
+   int ret;
+   struct vfio_group *group = filep-private_data;
+
+   if (WARN_ON(filep-f_op != vfio_group_fops))
+   return -EINVAL;
+
+   if (0 == atomic_read(group-container_users) ||
+   !group-container-iommu_driver ||
+   !vfio_group_viable(group))
+   return -EINVAL;
+
+   ret = iommu_group_id(group-iommu_group);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_group_iommu_id_from_file);
+
+/**
  * Module/class support
  */
 static char *vfio_devnode(struct device *dev, umode_t *mode)
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] vfio: add external user support

2013-06-27 Thread Alexey Kardashevskiy
On 06/27/2013 07:42 PM, Benjamin Herrenschmidt wrote:
 On Thu, 2013-06-27 at 16:59 +1000, Stephen Rothwell wrote:
 +/* Allows an external user (for example, KVM) to unlock an IOMMU
 group */
 +static void vfio_group_del_external_user(struct file *filep)
 +{
 + struct vfio_group *group = filep-private_data;
 +
 + BUG_ON(filep-f_op != vfio_group_fops);

 We usually reserve BUG_ON for situations where there is no way to
 continue running or continuing will corrupt the running kernel.  Maybe
 WARN_ON() and return?
 
 Not even that. This is a user space provided fd, we shouldn't oops the
 kernel because we passed a wrong argument, just return -EINVAL or
 something like that (add a return code).

I'll change to WARN_ON but...
This is going to be called on KVM exit on a file pointer previously
verified for correctness. If it is a wrong file*, then something went
terribly wrong.


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8 v5] KVM: PPC: IOMMU in-kernel handling

2013-07-06 Thread Alexey Kardashevskiy
The changes are:
1. rebased on v3.10
2. added arch_spin_locks to protect TCE table in real mode
3. reworked VFIO external API
4. added missing bits for real mode handling of TCE requests on p7ioc

MOre details in the individual patch comments.

Depends on hashtable: add hash_for_each_possible_rcu_notrace(),
posted earlier today.

Alexey Kardashevskiy (8):
  KVM: PPC: reserve a capability number for multitce support
  KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
  vfio: add external user support
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  powerpc: add real mode support for dma operations on powernv
  KVM: PPC: Add support for multiple-TCE hcalls
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt |  51 +++
 arch/powerpc/include/asm/iommu.h  |   9 +-
 arch/powerpc/include/asm/kvm_host.h   |  37 ++
 arch/powerpc/include/asm/kvm_ppc.h|  18 +-
 arch/powerpc/include/asm/machdep.h|  12 +
 arch/powerpc/include/asm/pgtable-ppc64.h  |   4 +
 arch/powerpc/include/uapi/asm/kvm.h   |   7 +
 arch/powerpc/kernel/iommu.c   | 200 +++
 arch/powerpc/kvm/book3s_64_vio.c  | 541 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c   | 404 --
 arch/powerpc/kvm/book3s_hv.c  |  41 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |   6 +
 arch/powerpc/kvm/book3s_pr_papr.c |  37 +-
 arch/powerpc/kvm/powerpc.c|  15 +
 arch/powerpc/mm/init_64.c |  78 -
 arch/powerpc/platforms/powernv/pci-ioda.c |  26 +-
 arch/powerpc/platforms/powernv/pci.c  |  38 ++-
 arch/powerpc/platforms/powernv/pci.h  |   2 +-
 drivers/vfio/vfio.c   |  35 ++
 include/linux/page-flags.h|   4 +-
 include/linux/vfio.h  |   7 +
 include/uapi/linux/kvm.h  |   3 +
 22 files changed, 1453 insertions(+), 122 deletions(-)

-- 
1.8.3.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-07-06 Thread Alexey Kardashevskiy
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a fast hypercall directly in real
mode (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Reviewed-by: Paul Mackerras pau...@samba.org
Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

---

Changes:
2013/06/27:
* realmode_get_page() fixed to use get_page_unless_zero(). If failed,
the call will be passed from real to virtual mode and safely handled.
* added comment to PageCompound() in include/linux/page-flags.h.

2013/05/20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/pgtable-ppc64.h |  4 ++
 arch/powerpc/mm/init_64.c| 78 +++-
 include/linux/page-flags.h   |  4 +-
 3 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..7b46e5f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t 
*pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a90b9c4..7031be3 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,81 @@ void vmemmap_free(unsigned long start, unsigned long end)
 {
 }
 
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_ functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct vmemmap_backing *vmem_back;
+   struct page *page;
+   unsigned long page_size = 1  mmu_psize_defs[mmu_vmemmap_psize].shift;
+   unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
 
+   for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back-list) {
+   if (pg_va  vmem_back-virt_addr)
+   continue;
+
+   /* Check that page struct is not split between real pages */
+   if ((pg_va + sizeof(struct page)) 
+   (vmem_back-virt_addr + page_size))
+   return NULL;
+
+   page = (struct page *) (vmem_back-phys + pg_va -
+   vmem_back-virt_addr);
+   return page;
+   }
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct page *page = pfn_to_page(pfn);
+   return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+   if (PageCompound(page))
+   return -EAGAIN;
+
+   if (!get_page_unless_zero(page))
+   return -EAGAIN;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+   if (PageCompound(page))
+   return -EAGAIN

[PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

2013-07-06 Thread Alexey Kardashevskiy
This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the hcall-multi-tce
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

---
Changelog:
2013/07/06:
* fixed number of wrong get_page()/put_page() calls

2013/06/27:
* fixed clear of BUSY bit in kvmppc_lookup_pte()
* H_PUT_TCE_INDIRECT does realmode_get_page() now
* KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
* updated doc

2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number

2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 Documentation/virtual/kvm/api.txt   |  25 +++
 arch/powerpc/include/asm/kvm_host.h |   9 ++
 arch/powerpc/include/asm/kvm_ppc.h  |  16 +-
 arch/powerpc/kvm/book3s_64_vio.c| 154 ++-
 arch/powerpc/kvm/book3s_64_vio_hv.c | 260 
 arch/powerpc/kvm/book3s_hv.c|  41 -
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +
 arch/powerpc/kvm/book3s_pr_papr.c   |  37 -
 arch/powerpc/kvm/powerpc.c  |   3 +
 9 files changed, 517 insertions(+), 34 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 6365fef..762c703 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed to 
userspace to be
 handled.
 
 
+4.86 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significanly accelerates DMA operations for PPC KVM guests.
+The user space should expect that its handlers for these hypercalls
+are not going to be called.
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+the user space might have to advertise it for the guest. For example,
+IBM pSeries guest starts using them if hcall-multi-tce is present in
+the ibm,hypertas-functions device-tree property.
+
+Without this capability, only H_PUT_TCE is handled by the kernel and
+therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
+unless the capability is present as passing hypercalls to the userspace
+slows operations a lot.
+
+Unlike other capabilities of this section, this one is always enabled.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index af326cd..20d04bd 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+   struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
struct page *pages[0];
 };
 
@@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+   unsigned long *tce_tmp_hpas;/* TCE cache for TCE_PUT_INDIRECT hcall 
*/
+   enum {
+   TCERM_NONE,
+   TCERM_GETPAGE,
+   TCERM_PUTTCE,
+   TCERM_PUTLIST,
+   } tce_rm_fail;  /* failed stage of request processing */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm

[PATCH 3/8] vfio: add external user support

2013-07-06 Thread Alexey Kardashevskiy
VFIO is designed to be used via ioctls on file descriptors
returned by VFIO.

However in some situations support for an external user is required.
The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
use the existing VFIO groups for exclusive access in real/virtual mode
on a host to avoid passing map/unmap requests to the user space which
would made things pretty slow.

The proposed protocol includes:

1. do normal VFIO init stuff such as opening a new container, attaching
group(s) to it, setting an IOMMU driver for a container. When IOMMU is
set for a container, all groups in it are considered ready to use by
an external user.

2. pass a fd of the group we want to accelerate to KVM. KVM calls
vfio_group_get_external_user() to verify if the group is initialized,
IOMMU is set for it and increment the container user counter to prevent
the VFIO group from disposal prior to KVM exit.
The current TCE IOMMU driver marks the whole IOMMU table as busy when
IOMMU is set for a container what prevents other DMA users from
allocating from it so it is safe to grant user space access to it.

3. KVM calls vfio_external_user_iommu_id() to obtian an IOMMU ID which
KVM uses to get an iommu_group struct for later use.

4. When KVM is finished, it calls vfio_group_put_external_user() to
release the VFIO group by decrementing the container user counter.
Everything gets released.

The vfio: Limit group opens patch is also required for the consistency.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c488da5..57aa191 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1370,6 +1370,62 @@ static const struct file_operations vfio_device_fops = {
 };
 
 /**
+ * External user API, exported by symbols to be linked dynamically.
+ *
+ * The protocol includes:
+ *  1. do normal VFIO init operation:
+ * - opening a new container;
+ * - attaching group(s) to it;
+ * - setting an IOMMU driver for a container.
+ * When IOMMU is set for a container, all groups in it are
+ * considered ready to use by an external user.
+ *
+ * 2. The user space passed a group fd which we want to accelerate in
+ * KVM. KVM uses vfio_group_get_external_user() to verify that:
+ * - the group is initialized;
+ * - IOMMU is set for it.
+ * Then vfio_group_get_external_user() increments the container user
+ * counter to prevent the VFIO group from disposal prior to KVM exit.
+ *
+ * 3. KVM calls vfio_external_user_iommu_id() to know an IOMMU ID which
+ * KVM uses to get an iommu_group struct for later use.
+ *
+ * 4. When KVM is finished, it calls vfio_group_put_external_user() to
+ * release the VFIO group by decrementing the container user counter.
+ */
+struct vfio_group *vfio_group_get_external_user(struct file *filep)
+{
+   struct vfio_group *group = filep-private_data;
+
+   if (filep-f_op != vfio_group_fops)
+   return NULL;
+
+   if (!atomic_inc_not_zero(group-container_users))
+   return NULL;
+
+   if (!group-container-iommu_driver ||
+   !vfio_group_viable(group)) {
+   atomic_dec(group-container_users);
+   return NULL;
+   }
+
+   return group;
+}
+EXPORT_SYMBOL_GPL(vfio_group_get_external_user);
+
+void vfio_group_put_external_user(struct vfio_group *group)
+{
+   vfio_group_try_dissolve_container(group);
+}
+EXPORT_SYMBOL_GPL(vfio_group_put_external_user);
+
+int vfio_external_user_iommu_id(struct vfio_group *group)
+{
+   return iommu_group_id(group-iommu_group);
+}
+EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id);
+
+/**
  * Module/class support
  */
 static char *vfio_devnode(struct device *dev, umode_t *mode)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index ac8d488..24579a0 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -90,4 +90,11 @@ extern void vfio_unregister_iommu_driver(
TYPE tmp;   \
offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \
 
+/*
+ * External user API
+ */
+extern struct vfio_group *vfio_group_get_external_user(struct file *filep);
+extern void vfio_group_put_external_user(struct vfio_group *group);
+extern int vfio_external_user_iommu_id(struct vfio_group *group);
+
 #endif /* VFIO_H */
-- 
1.8.3.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO

2013-07-06 Thread Alexey Kardashevskiy
This is to reserve a capablity number for upcoming support
of VFIO-IOMMU DMA operations in real mode.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 include/uapi/linux/kvm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 970b1f5..0865c01 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_SPAPR_MULTITCE 93
+#define KVM_CAP_SPAPR_TCE_IOMMU 94
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping {
 /* Available with KVM_CAP_PPC_ALLOC_HTAB */
 #define KVM_PPC_ALLOCATE_HTAB_IOWR(KVMIO, 0xa7, __u32)
 #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO,  0xa8, struct 
kvm_create_spapr_tce)
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct 
kvm_create_spapr_tce_iommu)
 /* Available with KVM_CAP_RMA */
 #define KVM_ALLOCATE_RMA _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
 /* Available with KVM_CAP_PPC_HTAB_FD */
-- 
1.8.3.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] KVM: PPC: Add support for IOMMU in-kernel handling

2013-07-06 Thread Alexey Kardashevskiy
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which saves time
on switching to QEMU and back.

Both real and virtual modes are supported. First the kernel tries to
handle a TCE request in the real mode, if failed it passes it to
the virtual mode to complete the operation. If it a virtual mode
handler fails, a request is passed to the user mode.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to associate
a virtual PCI bus ID (LIOBN) with an IOMMU group which enables
in-kernel handling of IOMMU map/unmap. The external user API support
in VFIO is required.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

---

Changes:
2013/07/06:
* added realmode arch_spin_lock to protect TCE table from races
in real and virtual modes
* POWERPC IOMMU API is changed to support real mode
* iommu_take_ownership and iommu_release_ownership are protected by
iommu_table's locks
* VFIO external user API use rewritten
* multiple small fixes

2013/06/27:
* tce_list page is referenced now in order to protect it from accident
invalidation during H_PUT_TCE_INDIRECT execution
* added use of the external user VFIO API

2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number

2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 Documentation/virtual/kvm/api.txt   |  26 
 arch/powerpc/include/asm/iommu.h|   9 +-
 arch/powerpc/include/asm/kvm_host.h |   3 +
 arch/powerpc/include/asm/kvm_ppc.h  |   2 +
 arch/powerpc/include/uapi/asm/kvm.h |   7 +
 arch/powerpc/kernel/iommu.c | 196 +++
 arch/powerpc/kvm/book3s_64_vio.c| 299 +++-
 arch/powerpc/kvm/book3s_64_vio_hv.c | 129 
 arch/powerpc/kvm/powerpc.c  |  12 ++
 9 files changed, 609 insertions(+), 74 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 762c703..01b0dc2 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2387,6 +2387,32 @@ slows operations a lot.
 Unlike other capabilities of this section, this one is always enabled.
 
 
+4.87 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_create_spapr_tce_iommu {
+   __u64 liobn;
+   __u32 iommu_id;
+   __u32 flags;
+};
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+In response to a TCE hypercall, the kernel looks for a TCE table descriptor
+in the list and handles the hypercall in real or virtual modes if
+the descriptor is found. Otherwise the hypercall is passed to the user mode.
+
+No flag is supported at the moment.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 98d1422..0845505 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,7 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
 #ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
+   arch_spinlock_t it_rm_lock;
 #endif
 };
 
@@ -159,9 +160,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table 
*tbl,
 extern int iommu_tce_put_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce);
 extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
-   unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
-   unsigned long entry);
+   unsigned long *hpas, unsigned long npages, bool rm);
+extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+   unsigned long npages, bool rm);
 extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long

[PATCH 1/8] KVM: PPC: reserve a capability number for multitce support

2013-07-06 Thread Alexey Kardashevskiy
This is to reserve a capablity number for upcoming support
of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls
which support mulptiple DMA map/unmap operations per one call.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d88c8ee..970b1f5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IRQ_MPIC 90
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
+#define KVM_CAP_SPAPR_MULTITCE 93
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.8.3.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-07-06 Thread Alexey Kardashevskiy
This adds special support for huge pages (16MB).  The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages.  It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens.  The list is released at KVM exit.  At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive.  However this can change and we may want
to optimize this.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

---

Changes:
2013/06/27:
* list of huge pages replaces with hashtable for better performance
* spinlock removed from real mode and only protects insertion of new
huge [ages descriptors into the hashtable

2013/06/05:
* fixed compile error when CONFIG_IOMMU_API=n

2013/05/20:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/kvm_host.h |  25 +
 arch/powerpc/kernel/iommu.c |   6 ++-
 arch/powerpc/kvm/book3s_64_vio.c| 104 +---
 arch/powerpc/kvm/book3s_64_vio_hv.c |  21 ++--
 4 files changed, 146 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 53e61b2..a7508cf 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
 #include linux/kvm_para.h
 #include linux/list.h
 #include linux/atomic.h
+#include linux/hashtable.h
 #include asm/kvm_asm.h
 #include asm/processor.h
 #include asm/page.h
@@ -182,10 +183,34 @@ struct kvmppc_spapr_tce_table {
u32 window_size;
struct iommu_group *grp;/* used for IOMMU groups */
struct vfio_group *vfio_grp;/* used for IOMMU groups */
+   DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
+   spinlock_t hugepages_write_lock;/* used for IOMMU groups */
struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
struct page *pages[0];
 };
 
+/*
+ * The KVM guest can be backed with 16MB pages.
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)hash_32(gpa  24, 32)
+
+struct kvmppc_spapr_iommu_hugepage {
+   struct hlist_node hash_node;
+   unsigned long gpa;  /* Guest physical address */
+   unsigned long hpa;  /* Host physical address */
+   struct page *page;  /* page struct of the very first subpage */
+   unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
 struct kvmppc_linear_info {
void*base_virt;
unsigned longbase_pfn;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 51678ec..e0b6eca 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long 
entry,
if (!pg) {
ret = -EAGAIN;
} else if (PageCompound(pg)) {
-   ret = -EAGAIN;
+   /* Hugepages will be released at KVM exit */
+   ret = 0;
} else {
if (oldtce  TCE_PCI_WRITE)
SetPageDirty(pg);
@@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned 
long entry,
struct page *pg = pfn_to_page(oldtce  PAGE_SHIFT);
if (!pg) {
ret = -EAGAIN;
+   } else if (PageCompound(pg)) {
+   /* Hugepages will be released at KVM exit */
+   ret = 0;
} else {
if (oldtce  TCE_PCI_WRITE)
SetPageDirty(pg);
diff --git a/arch

[PATCH 5/8] powerpc: add real mode support for dma operations on powernv

2013-07-06 Thread Alexey Kardashevskiy
The existing TCE machine calls (tce_build and tce_free) only support
virtual mode as they call __raw_writeq for TCE invalidation what
fails in real mode.

This introduces tce_build_rm and tce_free_rm real mode versions
which do mostly the same but use Store Doubleword Caching Inhibited
Indexed instruction for TCE invalidation.

This new feature is going to be utilized by real mode support of VFIO.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/machdep.h| 12 ++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++--
 arch/powerpc/platforms/powernv/pci.c  | 38 ++-
 arch/powerpc/platforms/powernv/pci.h  |  2 +-
 4 files changed, 64 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 92386fc..0c19eef 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -75,6 +75,18 @@ struct machdep_calls {
long index);
void(*tce_flush)(struct iommu_table *tbl);
 
+   /* _rm versions are for real mode use only */
+   int (*tce_build_rm)(struct iommu_table *tbl,
+long index,
+long npages,
+unsigned long uaddr,
+enum dma_data_direction direction,
+struct dma_attrs *attrs);
+   void(*tce_free_rm)(struct iommu_table *tbl,
+   long index,
+   long npages);
+   void(*tce_flush_rm)(struct iommu_table *tbl);
+
void __iomem *  (*ioremap)(phys_addr_t addr, unsigned long size,
   unsigned long flags, void *caller);
void(*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2931d97..2797dec 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -68,6 +68,12 @@ define_pe_printk_level(pe_err, KERN_ERR);
 define_pe_printk_level(pe_warn, KERN_WARNING);
 define_pe_printk_level(pe_info, KERN_INFO);
 
+static inline void rm_writed(unsigned long paddr, u64 val)
+{
+   __asm__ __volatile__(sync; stdcix %0,0,%1
+   : : r (val), r (paddr) : memory);
+}
+
 static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
 {
unsigned long pe;
@@ -442,7 +448,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, 
struct pci_dev *pdev
 }
 
 static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
-u64 *startp, u64 *endp)
+u64 *startp, u64 *endp, bool rm)
 {
u64 __iomem *invalidate = (u64 __iomem *)tbl-it_index;
unsigned long start, end, inc;
@@ -471,7 +477,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct 
iommu_table *tbl,
 
 mb(); /* Ensure above stores are visible */
 while (start = end) {
-__raw_writeq(start, invalidate);
+   if (rm)
+   rm_writed((unsigned long) invalidate, start);
+   else
+   __raw_writeq(start, invalidate);
 start += inc;
 }
 
@@ -483,7 +492,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table 
*tbl,
 
 static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 struct iommu_table *tbl,
-u64 *startp, u64 *endp)
+u64 *startp, u64 *endp, bool rm)
 {
unsigned long start, end, inc;
u64 __iomem *invalidate = (u64 __iomem *)tbl-it_index;
@@ -502,22 +511,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
pnv_ioda_pe *pe,
mb();
 
while (start = end) {
-   __raw_writeq(start, invalidate);
+   if (rm)
+   rm_writed((unsigned long) invalidate, start);
+   else
+   __raw_writeq(start, invalidate);
start += inc;
}
 }
 
 void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
-u64 *startp, u64 *endp)
+u64 *startp, u64 *endp, bool rm)
 {
struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
  tce32_table);
struct pnv_phb *phb = pe-phb;
 
if (phb-type == PNV_PHB_IODA1)
-   pnv_pci_ioda1_tce_invalidate(tbl, startp, endp);
+   pnv_pci_ioda1_tce_invalidate(tbl, startp, endp, rm);
else
-   pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp);
+   pnv_pci_ioda2_tce_invalidate(pe, tbl

Re: [PATCH 3/8] vfio: add external user support

2013-07-08 Thread Alexey Kardashevskiy
On 07/09/2013 07:52 AM, Alex Williamson wrote:
 On Sun, 2013-07-07 at 01:07 +1000, Alexey Kardashevskiy wrote:
 VFIO is designed to be used via ioctls on file descriptors
 returned by VFIO.

 However in some situations support for an external user is required.
 The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
 use the existing VFIO groups for exclusive access in real/virtual mode
 on a host to avoid passing map/unmap requests to the user space which
 would made things pretty slow.

 The proposed protocol includes:

 1. do normal VFIO init stuff such as opening a new container, attaching
 group(s) to it, setting an IOMMU driver for a container. When IOMMU is
 set for a container, all groups in it are considered ready to use by
 an external user.

 2. pass a fd of the group we want to accelerate to KVM. KVM calls
 vfio_group_get_external_user() to verify if the group is initialized,
 IOMMU is set for it and increment the container user counter to prevent
 the VFIO group from disposal prior to KVM exit.
 The current TCE IOMMU driver marks the whole IOMMU table as busy when
 IOMMU is set for a container what prevents other DMA users from
 allocating from it so it is safe to grant user space access to it.

 3. KVM calls vfio_external_user_iommu_id() to obtian an IOMMU ID which
 KVM uses to get an iommu_group struct for later use.

 4. When KVM is finished, it calls vfio_group_put_external_user() to
 release the VFIO group by decrementing the container user counter.
 Everything gets released.

 The vfio: Limit group opens patch is also required for the consistency.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
 index c488da5..57aa191 100644
 --- a/drivers/vfio/vfio.c
 +++ b/drivers/vfio/vfio.c
 @@ -1370,6 +1370,62 @@ static const struct file_operations vfio_device_fops 
 = {
  };
  
  /**
 + * External user API, exported by symbols to be linked dynamically.
 + *
 + * The protocol includes:
 + *  1. do normal VFIO init operation:
 + *  - opening a new container;
 + *  - attaching group(s) to it;
 + *  - setting an IOMMU driver for a container.
 + * When IOMMU is set for a container, all groups in it are
 + * considered ready to use by an external user.
 + *
 + * 2. The user space passed a group fd which we want to accelerate in
 + * KVM. KVM uses vfio_group_get_external_user() to verify that:
 + *  - the group is initialized;
 + *  - IOMMU is set for it.
 + * Then vfio_group_get_external_user() increments the container user
 + * counter to prevent the VFIO group from disposal prior to KVM exit.
 + *
 + * 3. KVM calls vfio_external_user_iommu_id() to know an IOMMU ID which
 + * KVM uses to get an iommu_group struct for later use.
 + *
 + * 4. When KVM is finished, it calls vfio_group_put_external_user() to
 + * release the VFIO group by decrementing the container user counter.
 
 nit, the interface is for any external user, not just kvm.

s/KVM/An external user/ ?
Or add the description below uses KVM just as an example of an external user?


 + */
 +struct vfio_group *vfio_group_get_external_user(struct file *filep)
 +{
 +struct vfio_group *group = filep-private_data;
 +
 +if (filep-f_op != vfio_group_fops)
 +return NULL;
 
 ERR_PTR(-EINVAL)
 
 There also needs to be a vfio_group_get(group) here and put in error
 cases.


Is that because I do not hold a reference to the file anymore?


 +
 +if (!atomic_inc_not_zero(group-container_users))
 +return NULL;
 
 ERR_PTR(-EINVAL)
 
 +
 +if (!group-container-iommu_driver ||
 +!vfio_group_viable(group)) {
 +atomic_dec(group-container_users);
 +return NULL;
 
 ERR_PTR(-EINVAL)
 
 +}
 +
 +return group;
 +}
 +EXPORT_SYMBOL_GPL(vfio_group_get_external_user);
 +
 +void vfio_group_put_external_user(struct vfio_group *group)
 +{
 +vfio_group_try_dissolve_container(group);
 
 And a vfio_group_put(group) here
 
 +}
 +EXPORT_SYMBOL_GPL(vfio_group_put_external_user);
 +
 +int vfio_external_user_iommu_id(struct vfio_group *group)
 +{
 +return iommu_group_id(group-iommu_group);
 +}
 +EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id);
 +
 +/**
   * Module/class support
   */
  static char *vfio_devnode(struct device *dev, umode_t *mode)
 diff --git a/include/linux/vfio.h b/include/linux/vfio.h
 index ac8d488..24579a0 100644
 --- a/include/linux/vfio.h
 +++ b/include/linux/vfio.h
 @@ -90,4 +90,11 @@ extern void vfio_unregister_iommu_driver(
  TYPE tmp;   \
  offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \
  
 +/*
 + * External user API
 + */
 +extern struct vfio_group *vfio_group_get_external_user(struct file *filep);
 +extern void vfio_group_put_external_user(struct vfio_group *group);
 +extern int vfio_external_user_iommu_id(struct vfio_group *group);
 +
  #endif /* VFIO_H */
 
 
 


-- 
Alexey
--
To unsubscribe from this list: send

Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-07-09 Thread Alexey Kardashevskiy
On 07/10/2013 03:32 AM, Alexander Graf wrote:
 On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
 This adds special support for huge pages (16MB).  The reference
 counting cannot be easily done for such pages in real mode (when
 MMU is off) so we added a list of huge pages.  It is populated in
 virtual mode and get_page is called just once per a huge page.
 Real mode handlers check if the requested page is huge and in the list,
 then no reference counting is done, otherwise an exit to virtual mode
 happens.  The list is released at KVM exit.  At the moment the fastest
 card available for tests uses up to 9 huge pages so walking through this
 list is not very expensive.  However this can change and we may want
 to optimize this.

 Signed-off-by: Paul Mackerraspau...@samba.org
 Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru

 ---

 Changes:
 2013/06/27:
 * list of huge pages replaces with hashtable for better performance
 
 So the only thing your patch description really talks about is not true
 anymore?
 
 * spinlock removed from real mode and only protects insertion of new
 huge [ages descriptors into the hashtable

 2013/06/05:
 * fixed compile error when CONFIG_IOMMU_API=n

 2013/05/20:
 * the real mode handler now searches for a huge page by gpa (used to be pte)
 * the virtual mode handler prints warning if it is called twice for the same
 huge page as the real mode handler is expected to fail just once - when a
 huge
 page is not in the list yet.
 * the huge page is refcounted twice - when added to the hugepage list and
 when used in the virtual mode hcall handler (can be optimized but it will
 make the patch less nice).

 Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru
 ---
   arch/powerpc/include/asm/kvm_host.h |  25 +
   arch/powerpc/kernel/iommu.c |   6 ++-
   arch/powerpc/kvm/book3s_64_vio.c| 104
 +---
   arch/powerpc/kvm/book3s_64_vio_hv.c |  21 ++--
   4 files changed, 146 insertions(+), 10 deletions(-)

 diff --git a/arch/powerpc/include/asm/kvm_host.h
 b/arch/powerpc/include/asm/kvm_host.h
 index 53e61b2..a7508cf 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -30,6 +30,7 @@
   #includelinux/kvm_para.h
   #includelinux/list.h
   #includelinux/atomic.h
 +#includelinux/hashtable.h
   #includeasm/kvm_asm.h
   #includeasm/processor.h
   #includeasm/page.h
 @@ -182,10 +183,34 @@ struct kvmppc_spapr_tce_table {
   u32 window_size;
   struct iommu_group *grp;/* used for IOMMU groups */
   struct vfio_group *vfio_grp;/* used for IOMMU groups */
 +DECLARE_HASHTABLE(hash_tab, ilog2(64));/* used for IOMMU groups */
 +spinlock_t hugepages_write_lock;/* used for IOMMU groups */
   struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
   struct page *pages[0];
   };

 +/*
 + * The KVM guest can be backed with 16MB pages.
 + * In this case, we cannot do page counting from the real mode
 + * as the compound pages are used - they are linked in a list
 + * with pointers as virtual addresses which are inaccessible
 + * in real mode.
 + *
 + * The code below keeps a 16MB pages list and uses page struct
 + * in real mode if it is already locked in RAM and inserted into
 + * the list or switches to the virtual mode where it can be
 + * handled in a usual manner.
 + */
 +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)hash_32(gpa  24, 32)
 +
 +struct kvmppc_spapr_iommu_hugepage {
 +struct hlist_node hash_node;
 +unsigned long gpa;/* Guest physical address */
 +unsigned long hpa;/* Host physical address */
 +struct page *page;/* page struct of the very first subpage */
 +unsigned long size;/* Huge page size (always 16MB at the moment) */
 +};
 +
   struct kvmppc_linear_info {
   void*base_virt;
   unsigned long base_pfn;
 diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
 index 51678ec..e0b6eca 100644
 --- a/arch/powerpc/kernel/iommu.c
 +++ b/arch/powerpc/kernel/iommu.c
 @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned
 long entry,
   if (!pg) {
   ret = -EAGAIN;
   } else if (PageCompound(pg)) {
 -ret = -EAGAIN;
 +/* Hugepages will be released at KVM exit */
 +ret = 0;
   } else {
   if (oldtce  TCE_PCI_WRITE)
   SetPageDirty(pg);
 @@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl,
 unsigned long entry,
   struct page *pg = pfn_to_page(oldtce  PAGE_SHIFT);
   if (!pg) {
   ret = -EAGAIN;
 +} else if (PageCompound(pg)) {
 +/* Hugepages will be released at KVM exit */
 +ret = 0;
   } else {
   if (oldtce  TCE_PCI_WRITE)
   SetPageDirty(pg);
 diff --git a/arch/powerpc

Re: [PATCH 2/8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO

2013-07-09 Thread Alexey Kardashevskiy
On 07/10/2013 01:35 AM, Alexander Graf wrote:
 On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote:
 Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru
 ---
   include/uapi/linux/kvm.h |2 ++
   1 file changed, 2 insertions(+)

 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 index 970b1f5..0865c01 100644
 --- a/include/uapi/linux/kvm.h
 +++ b/include/uapi/linux/kvm.h
 @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
   #define KVM_CAP_PPC_RTAS 91
   #define KVM_CAP_IRQ_XICS 92
   #define KVM_CAP_SPAPR_MULTITCE 93
 +#define KVM_CAP_SPAPR_TCE_IOMMU 94

   #ifdef KVM_CAP_IRQ_ROUTING

 @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping {
   /* Available with KVM_CAP_PPC_ALLOC_HTAB */
   #define KVM_PPC_ALLOCATE_HTAB  _IOWR(KVMIO, 0xa7, __u32)
   #define KVM_CREATE_SPAPR_TCE  _IOW(KVMIO,  0xa8, struct
 kvm_create_spapr_tce)
 +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct
 kvm_create_spapr_tce_iommu)
 
 Please order them by number.

Oh. Again :( We have had this discussion with Scott Wood here already.
Where _exactly_ do you want me to put it? Many sections, not really
ordered. Thank you.



-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

2013-07-09 Thread Alexey Kardashevskiy
On 07/10/2013 03:02 AM, Alexander Graf wrote:
 On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
 This adds real mode handlers for the H_PUT_TCE_INDIRECT and
 H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
 devices or emulated PCI.  These calls allow adding multiple entries
 (up to 512) into the TCE table in one call which saves time on
 transition to/from real mode.
 
 We don't mention QEMU explicitly in KVM code usually.
 
 This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
 (copied from user and verified) before writing the whole list into
 the TCE table. This cache will be utilized more in the upcoming
 VFIO/IOMMU support to continue TCE list processing in the virtual
 mode in the case if the real mode handler failed for some reason.

 This adds a guest physical to host real address converter
 and calls the existing H_PUT_TCE handler. The converting function
 is going to be fully utilized by upcoming VFIO supporting patches.

 This also implements the KVM_CAP_PPC_MULTITCE capability,
 so in order to support the functionality of this patch, QEMU
 needs to query for this capability and set the hcall-multi-tce
 hypertas property only if the capability is present, otherwise
 there will be serious performance degradation.
 
 Same as above. But really you're only giving recommendations here. What's
 the point? Please describe what the benefit of this patch is, not what some
 other random subsystem might do with the benefits it brings.
 

 Signed-off-by: Paul Mackerraspau...@samba.org
 Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru

 ---
 Changelog:
 2013/07/06:
 * fixed number of wrong get_page()/put_page() calls

 2013/06/27:
 * fixed clear of BUSY bit in kvmppc_lookup_pte()
 * H_PUT_TCE_INDIRECT does realmode_get_page() now
 * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
 * updated doc

 2013/06/05:
 * fixed mistype about IBMVIO in the commit message
 * updated doc and moved it to another section
 * changed capability number

 2013/05/21:
 * added kvm_vcpu_arch::tce_tmp
 * removed cleanup if put_indirect failed, instead we do not even start
 writing to TCE table if we cannot get TCEs from the user and they are
 invalid
 * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
 and kvmppc_emulated_validate_tce (for the previous item)
 * fixed bug with failthrough for H_IPI
 * removed all get_user() from real mode handlers
 * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)

 Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru
 ---
   Documentation/virtual/kvm/api.txt   |  25 +++
   arch/powerpc/include/asm/kvm_host.h |   9 ++
   arch/powerpc/include/asm/kvm_ppc.h  |  16 +-
   arch/powerpc/kvm/book3s_64_vio.c| 154 ++-
   arch/powerpc/kvm/book3s_64_vio_hv.c | 260
 
   arch/powerpc/kvm/book3s_hv.c|  41 -
   arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +
   arch/powerpc/kvm/book3s_pr_papr.c   |  37 -
   arch/powerpc/kvm/powerpc.c  |   3 +
   9 files changed, 517 insertions(+), 34 deletions(-)

 diff --git a/Documentation/virtual/kvm/api.txt
 b/Documentation/virtual/kvm/api.txt
 index 6365fef..762c703 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
 to userspace to be
   handled.


 +4.86 KVM_CAP_PPC_MULTITCE
 +
 +Capability: KVM_CAP_PPC_MULTITCE
 +Architectures: ppc
 +Type: vm
 +
 +This capability means the kernel is capable of handling hypercalls
 +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
 +space. This significanly accelerates DMA operations for PPC KVM guests.
 
 significanly? Please run this through a spell checker.
 
 +The user space should expect that its handlers for these hypercalls
 
 s/The//
 
 +are not going to be called.
 
 Is user space guaranteed they will not be called? Or can it still happen?

... if user space previously registered LIOBN in KVM (via
KVM_CREATE_SPAPR_TCE or similar calls).

ok?

There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
and may never get there.


 +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
 +the user space might have to advertise it for the guest. For example,
 +IBM pSeries guest starts using them if hcall-multi-tce is present in
 +the ibm,hypertas-functions device-tree property.
 
 This paragraph describes sPAPR. That's fine, but please document it as
 such. Also please check your grammar.

 +
 +Without this capability, only H_PUT_TCE is handled by the kernel and
 +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
 +unless the capability is present as passing hypercalls to the userspace
 +slows operations a lot.
 +
 +Unlike other capabilities of this section, this one is always enabled.
 
 Why? Wouldn't that confuse older user space?


How? Old user space won't check

Re: [PATCH 2/8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO

2013-07-10 Thread Alexey Kardashevskiy
On 07/10/2013 08:27 PM, Alexander Graf wrote:
 
 On 10.07.2013, at 01:35, Alexey Kardashevskiy wrote:
 
 On 07/10/2013 01:35 AM, Alexander Graf wrote:
 On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote:
 Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru
 ---
  include/uapi/linux/kvm.h |2 ++
  1 file changed, 2 insertions(+)

 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 index 970b1f5..0865c01 100644
 --- a/include/uapi/linux/kvm.h
 +++ b/include/uapi/linux/kvm.h
 @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_PPC_RTAS 91
  #define KVM_CAP_IRQ_XICS 92
  #define KVM_CAP_SPAPR_MULTITCE 93
 +#define KVM_CAP_SPAPR_TCE_IOMMU 94

  #ifdef KVM_CAP_IRQ_ROUTING

 @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping {
  /* Available with KVM_CAP_PPC_ALLOC_HTAB */
  #define KVM_PPC_ALLOCATE_HTAB  _IOWR(KVMIO, 0xa7, __u32)
  #define KVM_CREATE_SPAPR_TCE  _IOW(KVMIO,  0xa8, struct
 kvm_create_spapr_tce)
 +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct
 kvm_create_spapr_tce_iommu)

 Please order them by number.

 Oh. Again :( We have had this discussion with Scott Wood here already.
 Where _exactly_ do you want me to put it?
 
 8 lines further down. With a comment saying when it's available. Also why is 
 it af, not ad?


0xad and 0xae are taken.
Where should I have commented this? In the commit message? Or in the patch
itself?


 Many sections, not really ordered. Thank you.
 
 They should all be ordered inside of their own categories.


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

2013-07-10 Thread Alexey Kardashevskiy
On 07/10/2013 08:05 PM, Alexander Graf wrote:
 
 On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:
 
 On 07/10/2013 03:02 AM, Alexander Graf wrote:
 On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
 This adds real mode handlers for the H_PUT_TCE_INDIRECT and
 H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
 devices or emulated PCI.  These calls allow adding multiple entries
 (up to 512) into the TCE table in one call which saves time on
 transition to/from real mode.

 We don't mention QEMU explicitly in KVM code usually.

 This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
 (copied from user and verified) before writing the whole list into
 the TCE table. This cache will be utilized more in the upcoming
 VFIO/IOMMU support to continue TCE list processing in the virtual
 mode in the case if the real mode handler failed for some reason.

 This adds a guest physical to host real address converter
 and calls the existing H_PUT_TCE handler. The converting function
 is going to be fully utilized by upcoming VFIO supporting patches.

 This also implements the KVM_CAP_PPC_MULTITCE capability,
 so in order to support the functionality of this patch, QEMU
 needs to query for this capability and set the hcall-multi-tce
 hypertas property only if the capability is present, otherwise
 there will be serious performance degradation.

 Same as above. But really you're only giving recommendations here. What's
 the point? Please describe what the benefit of this patch is, not what some
 other random subsystem might do with the benefits it brings.


 Signed-off-by: Paul Mackerraspau...@samba.org
 Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru

 ---
 Changelog:
 2013/07/06:
 * fixed number of wrong get_page()/put_page() calls

 2013/06/27:
 * fixed clear of BUSY bit in kvmppc_lookup_pte()
 * H_PUT_TCE_INDIRECT does realmode_get_page() now
 * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
 * updated doc

 2013/06/05:
 * fixed mistype about IBMVIO in the commit message
 * updated doc and moved it to another section
 * changed capability number

 2013/05/21:
 * added kvm_vcpu_arch::tce_tmp
 * removed cleanup if put_indirect failed, instead we do not even start
 writing to TCE table if we cannot get TCEs from the user and they are
 invalid
 * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
 and kvmppc_emulated_validate_tce (for the previous item)
 * fixed bug with failthrough for H_IPI
 * removed all get_user() from real mode handlers
 * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)

 Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru
 ---
  Documentation/virtual/kvm/api.txt   |  25 +++
  arch/powerpc/include/asm/kvm_host.h |   9 ++
  arch/powerpc/include/asm/kvm_ppc.h  |  16 +-
  arch/powerpc/kvm/book3s_64_vio.c| 154 ++-
  arch/powerpc/kvm/book3s_64_vio_hv.c | 260
 
  arch/powerpc/kvm/book3s_hv.c|  41 -
  arch/powerpc/kvm/book3s_hv_rmhandlers.S |   6 +
  arch/powerpc/kvm/book3s_pr_papr.c   |  37 -
  arch/powerpc/kvm/powerpc.c  |   3 +
  9 files changed, 517 insertions(+), 34 deletions(-)

 diff --git a/Documentation/virtual/kvm/api.txt
 b/Documentation/virtual/kvm/api.txt
 index 6365fef..762c703 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
 to userspace to be
  handled.


 +4.86 KVM_CAP_PPC_MULTITCE
 +
 +Capability: KVM_CAP_PPC_MULTITCE
 +Architectures: ppc
 +Type: vm
 +
 +This capability means the kernel is capable of handling hypercalls
 +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
 +space. This significanly accelerates DMA operations for PPC KVM guests.

 significanly? Please run this through a spell checker.

 +The user space should expect that its handlers for these hypercalls

 s/The//

 +are not going to be called.

 Is user space guaranteed they will not be called? Or can it still happen?

 ... if user space previously registered LIOBN in KVM (via
 KVM_CREATE_SPAPR_TCE or similar calls).

 ok?
 
 How about this?
 
 The hypercalls mentioned above may or may not be processed successfully in 
 the kernel based fast path. If they can not be handled by the kernel, they 
 will get passed on to user space. So user space still has to have an 
 implementation for these despite the in kernel acceleration.
 
 ---
 
 The target audience for this documentation is user space KVM API users. 
 Someone developing kvm tool for example. They want to know implications 
 specific CAPs have.
 

 There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
 and may never get there.


 +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
 +the user space might have to advertise it for the guest. For example,
 +IBM pSeries guest starts using them if hcall-multi-tce is present

Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-07-11 Thread Alexey Kardashevskiy
On 07/10/2013 03:32 AM, Alexander Graf wrote:
 On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
 This adds special support for huge pages (16MB).  The reference
 counting cannot be easily done for such pages in real mode (when
 MMU is off) so we added a list of huge pages.  It is populated in
 virtual mode and get_page is called just once per a huge page.
 Real mode handlers check if the requested page is huge and in the list,
 then no reference counting is done, otherwise an exit to virtual mode
 happens.  The list is released at KVM exit.  At the moment the fastest
 card available for tests uses up to 9 huge pages so walking through this
 list is not very expensive.  However this can change and we may want
 to optimize this.

 Signed-off-by: Paul Mackerraspau...@samba.org
 Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru

 ---

 Changes:
 2013/06/27:
 * list of huge pages replaces with hashtable for better performance
 
 So the only thing your patch description really talks about is not true
 anymore?
 
 * spinlock removed from real mode and only protects insertion of new
 huge [ages descriptors into the hashtable

 2013/06/05:
 * fixed compile error when CONFIG_IOMMU_API=n

 2013/05/20:
 * the real mode handler now searches for a huge page by gpa (used to be pte)
 * the virtual mode handler prints warning if it is called twice for the same
 huge page as the real mode handler is expected to fail just once - when a
 huge
 page is not in the list yet.
 * the huge page is refcounted twice - when added to the hugepage list and
 when used in the virtual mode hcall handler (can be optimized but it will
 make the patch less nice).

 Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru
 ---
   arch/powerpc/include/asm/kvm_host.h |  25 +
   arch/powerpc/kernel/iommu.c |   6 ++-
   arch/powerpc/kvm/book3s_64_vio.c| 104
 +---
   arch/powerpc/kvm/book3s_64_vio_hv.c |  21 ++--
   4 files changed, 146 insertions(+), 10 deletions(-)

 diff --git a/arch/powerpc/include/asm/kvm_host.h
 b/arch/powerpc/include/asm/kvm_host.h
 index 53e61b2..a7508cf 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -30,6 +30,7 @@
   #includelinux/kvm_para.h
   #includelinux/list.h
   #includelinux/atomic.h
 +#includelinux/hashtable.h
   #includeasm/kvm_asm.h
   #includeasm/processor.h
   #includeasm/page.h
 @@ -182,10 +183,34 @@ struct kvmppc_spapr_tce_table {
   u32 window_size;
   struct iommu_group *grp;/* used for IOMMU groups */
   struct vfio_group *vfio_grp;/* used for IOMMU groups */
 +DECLARE_HASHTABLE(hash_tab, ilog2(64));/* used for IOMMU groups */
 +spinlock_t hugepages_write_lock;/* used for IOMMU groups */
   struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
   struct page *pages[0];
   };

 +/*
 + * The KVM guest can be backed with 16MB pages.
 + * In this case, we cannot do page counting from the real mode
 + * as the compound pages are used - they are linked in a list
 + * with pointers as virtual addresses which are inaccessible
 + * in real mode.
 + *
 + * The code below keeps a 16MB pages list and uses page struct
 + * in real mode if it is already locked in RAM and inserted into
 + * the list or switches to the virtual mode where it can be
 + * handled in a usual manner.
 + */
 +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)hash_32(gpa  24, 32)
 +
 +struct kvmppc_spapr_iommu_hugepage {
 +struct hlist_node hash_node;
 +unsigned long gpa;/* Guest physical address */
 +unsigned long hpa;/* Host physical address */
 +struct page *page;/* page struct of the very first subpage */
 +unsigned long size;/* Huge page size (always 16MB at the moment) */
 +};
 +
   struct kvmppc_linear_info {
   void*base_virt;
   unsigned long base_pfn;
 diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
 index 51678ec..e0b6eca 100644
 --- a/arch/powerpc/kernel/iommu.c
 +++ b/arch/powerpc/kernel/iommu.c
 @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned
 long entry,
   if (!pg) {
   ret = -EAGAIN;
   } else if (PageCompound(pg)) {
 -ret = -EAGAIN;
 +/* Hugepages will be released at KVM exit */
 +ret = 0;
   } else {
   if (oldtce  TCE_PCI_WRITE)
   SetPageDirty(pg);
 @@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl,
 unsigned long entry,
   struct page *pg = pfn_to_page(oldtce  PAGE_SHIFT);
   if (!pg) {
   ret = -EAGAIN;
 +} else if (PageCompound(pg)) {
 +/* Hugepages will be released at KVM exit */
 +ret = 0;
   } else {
   if (oldtce  TCE_PCI_WRITE)
   SetPageDirty(pg);
 diff --git a/arch/powerpc

Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

2013-07-11 Thread Alexey Kardashevskiy
On 07/11/2013 10:51 PM, Alexander Graf wrote:
 
 On 11.07.2013, at 14:39, Benjamin Herrenschmidt wrote:
 
 On Thu, 2013-07-11 at 13:15 +0200, Alexander Graf wrote:
 And that's bad. Jeez, seriously. Don't argue this case. We enable new
 features individually unless we're 100% sure we can keep everything
 working. In this case an ENABLE_CAP doesn't hurt at all, because user
 space still needs to handle the hypercalls if it wants them anyways.
 But you get debugging for free for example.

 An ENABLE_CAP is utterly pointless. More bloat. But you seem to like
 it :-)
 

 I don't like bloat usually. But Alexey even had an #ifdef DEBUG in there
 to selectively disable in-kernel handling of multi-TCE. Not calling
 ENABLE_CAP would give him exactly that without ugly #ifdefs in the
 kernel.


No, it would not give m anithing. My ugly debug was to disable realmode
only and still leave virtual mode on, not to disable both real and virtual
modes. It is a lot easier to disable in kernel handling in QEMU.



-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

2013-07-11 Thread Alexey Kardashevskiy
On 07/11/2013 10:58 PM, Benjamin Herrenschmidt wrote:
 On Thu, 2013-07-11 at 14:51 +0200, Alexander Graf wrote:
 I don't like bloat usually. But Alexey even had an #ifdef DEBUG in
 there to selectively disable in-kernel handling of multi-TCE. Not
 calling ENABLE_CAP would give him exactly that without ugly #ifdefs in
 the kernel.
 
 I don't see much point in disabling it... but ok, if that's a valuable
 feature, then shoot some VM level ENABLE_CAP (please don't iterate all
 VCPUs, that's gross).

No use for me whatsoever as I only want to disable real more handlers and
keep virtual mode handlers enabled (sometime, for debug only) and this
capability is not about that - I can easily just not enable it in QEMU with
the exactly the same effect.

So please, fellas, decide whether I should iterate vcpu's or add ENABLE_CAP
per KVM. Thanks.


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-07-11 Thread Alexey Kardashevskiy
On 07/11/2013 11:41 PM, chandrashekar shastri wrote:
 Hi All,
 
 I complied the latest kernel 3.10.0+ pulled from the git on top of 
 3.10.0-rc5+ by enabling the new Virtualiztaion features. The compliation
 was sucessfull, when I rebooted the machine it fails to boot with error as
  systemd [1] : Failed to mount /dev : no such device.
 
 Is it problem with the KVM module?


Wrong thread actually, would be better if you started the new one.

And you may want to try this - http://patchwork.ozlabs.org/patch/256027/


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] powerpc/iommu: rework to support realmode

2013-07-15 Thread Alexey Kardashevskiy
The TCE tables handling may differ for real and virtual modes so
additional ppc_md.tce_build_rm/ppc_md.tce_free_rm/ppc_md.tce_flush_rm
handlers were introduced earlier.

So this adds the following:
1. support for the new ppc_md calls;
2. ability to iommu_tce_build to process mupltiple entries per
call;
3. arch_spin_lock to protect TCE table from races in both real and virtual
modes;
4. proper TCE table protection from races with the existing IOMMU code
in iommu_take_ownership/iommu_release_ownership;
5. hwaddr variable renamed to hpa as it better describes what it
actually represents;
6. iommu_tce_direction is static now as it is not called from anywhere else.

This will be used by upcoming real mode support of VFIO on POWER.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/iommu.h |   9 +-
 arch/powerpc/kernel/iommu.c  | 197 ++-
 2 files changed, 135 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index c34656a..b01bde1 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,7 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
 #ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
+   arch_spinlock_t it_rm_lock;
 #endif
 };
 
@@ -152,9 +153,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table 
*tbl,
 extern int iommu_tce_put_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce);
 extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
-   unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
-   unsigned long entry);
+   unsigned long *hpas, unsigned long npages, bool rm);
+extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+   unsigned long npages, bool rm);
 extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
 extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
@@ -164,7 +165,5 @@ extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
 extern void iommu_release_ownership(struct iommu_table *tbl);
 
-extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b20ff17..0f56cac 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -903,7 +903,7 @@ void iommu_register_group(struct iommu_table *tbl,
kfree(name);
 }
 
-enum dma_data_direction iommu_tce_direction(unsigned long tce)
+static enum dma_data_direction iommu_tce_direction(unsigned long tce)
 {
if ((tce  TCE_PCI_READ)  (tce  TCE_PCI_WRITE))
return DMA_BIDIRECTIONAL;
@@ -914,7 +914,6 @@ enum dma_data_direction iommu_tce_direction(unsigned long 
tce)
else
return DMA_NONE;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_direction);
 
 void iommu_flush_tce(struct iommu_table *tbl)
 {
@@ -972,73 +971,116 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
 
-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
-   unsigned long oldtce;
-   struct iommu_pool *pool = get_pool(tbl, entry);
-
-   spin_lock((pool-lock));
-
-   oldtce = ppc_md.tce_get(tbl, entry);
-   if (oldtce  (TCE_PCI_WRITE | TCE_PCI_READ))
-   ppc_md.tce_free(tbl, entry, 1);
-   else
-   oldtce = 0;
-
-   spin_unlock((pool-lock));
-
-   return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
 int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
 {
-   unsigned long oldtce;
-   struct page *page;
-
-   for ( ; pages; --pages, ++entry) {
-   oldtce = iommu_clear_tce(tbl, entry);
-   if (!oldtce)
-   continue;
-
-   page = pfn_to_page(oldtce  PAGE_SHIFT);
-   WARN_ON(!page);
-   if (page) {
-   if (oldtce  TCE_PCI_WRITE)
-   SetPageDirty(page);
-   put_page(page);
-   }
-   }
-
-   return 0;
+   return iommu_free_tces(tbl, entry, pages, false);
 }
 EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
 
-/*
- * hwaddr is a kernel virtual address here (0xc... bazillion),
- * tce_build converts it to a physical address.
- */
+int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+   unsigned long npages, bool rm)
+{
+   int i, ret = 0, to_free = 0;
+
+   if (rm  !ppc_md.tce_free_rm)
+   return

Re: [PATCH 00/10 v6] KVM: PPC: IOMMU in-kernel handling

2013-07-18 Thread Alexey Kardashevskiy
On 07/16/2013 10:53 AM, Alexey Kardashevskiy wrote:
 The changes are:
 1. rebased on v3.11-rc1 so the capability numbers changed again
 2. fixed multiple comments from maintainers
 3. KVM: PPC: Add support for IOMMU in-kernel handling is split into
 2 patches, the new one is powerpc/iommu: rework to support realmode.
 4. IOMMU_API is now always enabled for KVM_BOOK3S_64.
 
 MOre details in the individual patch comments.
 
 Depends on hashtable: add hash_for_each_possible_rcu_notrace(),
 posted a while ago.
 
 
 Alexey Kardashevskiy (10):
   KVM: PPC: reserve a capability number for multitce support
   KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO



Alex, could you please pull these 2 patches or tell what is wrong with
them? Having them sooner in the kernel would let me ask for a headers
update for QEMU and then I would try pushing miltiple TCE and VFIO support
in QEMU. Thanks.




-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-07-22 Thread Alexey Kardashevskiy
Ping, anyone, please?

Ben needs ack from any of MM people before proceeding with this patch. Thanks!


On 07/16/2013 10:53 AM, Alexey Kardashevskiy wrote:
 The current VFIO-on-POWER implementation supports only user mode
 driven mapping, i.e. QEMU is sending requests to map/unmap pages.
 However this approach is really slow, so we want to move that to KVM.
 Since H_PUT_TCE can be extremely performance sensitive (especially with
 network adapters where each packet needs to be mapped/unmapped) we chose
 to implement that as a fast hypercall directly in real
 mode (processor still in the guest context but MMU off).
 
 To be able to do that, we need to provide some facilities to
 access the struct page count within that real mode environment as things
 like the sparsemem vmemmap mappings aren't accessible.
 
 This adds an API to increment/decrement page counter as
 get_user_pages API used for user mode mapping does not work
 in the real mode.
 
 CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.
 
 Cc: linux...@kvack.org
 Reviewed-by: Paul Mackerras pau...@samba.org
 Signed-off-by: Paul Mackerras pau...@samba.org
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 
 ---
 
 Changes:
 2013/07/10:
 * adjusted comment (removed sentence about virtual mode)
 * get_page_unless_zero replaced with atomic_inc_not_zero to minimize
 effect of a possible get_page_unless_zero() rework (if it ever happens).
 
 2013/06/27:
 * realmode_get_page() fixed to use get_page_unless_zero(). If failed,
 the call will be passed from real to virtual mode and safely handled.
 * added comment to PageCompound() in include/linux/page-flags.h.
 
 2013/05/20:
 * PageTail() is replaced by PageCompound() in order to have the same checks
 for whether the page is huge in realmode_get_page() and realmode_put_page()
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  arch/powerpc/include/asm/pgtable-ppc64.h |  4 ++
  arch/powerpc/mm/init_64.c| 76 
 +++-
  include/linux/page-flags.h   |  4 +-
  3 files changed, 82 insertions(+), 2 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
 b/arch/powerpc/include/asm/pgtable-ppc64.h
 index 46db094..aa7b169 100644
 --- a/arch/powerpc/include/asm/pgtable-ppc64.h
 +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
 @@ -394,6 +394,10 @@ static inline void mark_hpte_slot_valid(unsigned char 
 *hpte_slot_array,
   hpte_slot_array[index] = hidx  4 | 0x1  3;
  }
  
 +struct page *realmode_pfn_to_page(unsigned long pfn);
 +int realmode_get_page(struct page *page);
 +int realmode_put_page(struct page *page);
 +
  static inline char *get_hpte_slot_array(pmd_t *pmdp)
  {
   /*
 diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
 index d0cd9e4..dcbb806 100644
 --- a/arch/powerpc/mm/init_64.c
 +++ b/arch/powerpc/mm/init_64.c
 @@ -300,5 +300,79 @@ void vmemmap_free(unsigned long start, unsigned long end)
  {
  }
  
 -#endif /* CONFIG_SPARSEMEM_VMEMMAP */
 +/*
 + * We do not have access to the sparsemem vmemmap, so we fallback to
 + * walking the list of sparsemem blocks which we already maintain for
 + * the sake of crashdump. In the long run, we might want to maintain
 + * a tree if performance of that linear walk becomes a problem.
 + *
 + * Any of realmode_ functions can fail due to:
 + * 1) As real sparsemem blocks do not lay in RAM continously (they
 + * are in virtual address space which is not available in the real mode),
 + * the requested page struct can be split between blocks so get_page/put_page
 + * may fail.
 + * 2) When huge pages are used, the get_page/put_page API will fail
 + * in real mode as the linked addresses in the page struct are virtual
 + * too.
 + */
 +struct page *realmode_pfn_to_page(unsigned long pfn)
 +{
 + struct vmemmap_backing *vmem_back;
 + struct page *page;
 + unsigned long page_size = 1  mmu_psize_defs[mmu_vmemmap_psize].shift;
 + unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
  
 + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back-list) {
 + if (pg_va  vmem_back-virt_addr)
 + continue;
 +
 + /* Check that page struct is not split between real pages */
 + if ((pg_va + sizeof(struct page)) 
 + (vmem_back-virt_addr + page_size))
 + return NULL;
 +
 + page = (struct page *) (vmem_back-phys + pg_va -
 + vmem_back-virt_addr);
 + return page;
 + }
 +
 + return NULL;
 +}
 +EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
 +
 +#elif defined(CONFIG_FLATMEM)
 +
 +struct page *realmode_pfn_to_page(unsigned long pfn)
 +{
 + struct page *page = pfn_to_page(pfn);
 + return page;
 +}
 +EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
 +
 +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
 +
 +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
 +int realmode_get_page

Re: [PATCH 03/10] vfio: add external user support

2013-07-23 Thread Alexey Kardashevskiy
On 07/23/2013 12:23 PM, Alex Williamson wrote:
 On Tue, 2013-07-16 at 10:53 +1000, Alexey Kardashevskiy wrote:
 VFIO is designed to be used via ioctls on file descriptors
 returned by VFIO.

 However in some situations support for an external user is required.
 The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
 use the existing VFIO groups for exclusive access in real/virtual mode
 on a host to avoid passing map/unmap requests to the user space which
 would made things pretty slow.

 The protocol includes:

 1. do normal VFIO init operation:
  - opening a new container;
  - attaching group(s) to it;
  - setting an IOMMU driver for a container.
 When IOMMU is set for a container, all groups in it are
 considered ready to use by an external user.

 2. User space passes a group fd to an external user.
 The external user calls vfio_group_get_external_user()
 to verify that:
  - the group is initialized;
  - IOMMU is set for it.
 If both checks passed, vfio_group_get_external_user()
 increments the container user counter to prevent
 the VFIO group from disposal before KVM exits.

 3. The external user calls vfio_external_user_iommu_id()
 to know an IOMMU ID. PPC64 KVM uses it to link logical bus
 number (LIOBN) with IOMMU ID.

 4. When the external KVM finishes, it calls
 vfio_group_put_external_user() to release the VFIO group.
 This call decrements the container user counter.
 Everything gets released.

 The vfio: Limit group opens patch is also required for the consistency.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 
 This looks fine to me.  Is the plan to add this through the ppc tree
 again?  Thanks,


Nope, better to add this through your tree. And faster for sure :) Thanks!



-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/10 v7] KVM: PPC: IOMMU in-kernel handling

2013-07-31 Thread Alexey Kardashevskiy
This accelerates VFIO DMA operations on POWER by moving them
into kernel.

The changes in this series are:
1. rebased on v3.11-rc3.
2. VFIO external user API will go through VFIO tree so it is
excluded from this series.
3. As nobody ever reacted on hashtable: add 
hash_for_each_possible_rcu_notrace(),
Ben suggested to push it via his tree so I included it to the series.
4. realmode_(get|put)_page is reworked.

More details in the individual patch comments.

Alexey Kardashevskiy (10):
  hashtable: add hash_for_each_possible_rcu_notrace()
  KVM: PPC: reserve a capability number for multitce support
  KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  powerpc: add real mode support for dma operations on powernv
  KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Add support for multiple-TCE hcalls
  powerpc/iommu: rework to support realmode
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt |  52 +++
 arch/powerpc/include/asm/iommu.h  |   9 +-
 arch/powerpc/include/asm/kvm_host.h   |  37 +++
 arch/powerpc/include/asm/kvm_ppc.h|  18 +-
 arch/powerpc/include/asm/machdep.h|  12 +
 arch/powerpc/include/asm/pgtable-ppc64.h  |   2 +
 arch/powerpc/include/uapi/asm/kvm.h   |   7 +
 arch/powerpc/kernel/iommu.c   | 202 +++
 arch/powerpc/kvm/Kconfig  |   1 +
 arch/powerpc/kvm/book3s_64_vio.c  | 533 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c   | 405 +--
 arch/powerpc/kvm/book3s_hv.c  |  41 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |   8 +-
 arch/powerpc/kvm/book3s_pr_papr.c |  35 ++
 arch/powerpc/kvm/powerpc.c|  15 +
 arch/powerpc/mm/init_64.c |  50 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c |  47 ++-
 arch/powerpc/platforms/powernv/pci.c  |  38 ++-
 arch/powerpc/platforms/powernv/pci.h  |   3 +-
 include/linux/hashtable.h |  15 +
 include/linux/mm.h|  14 +
 include/linux/page-flags.h|   4 +-
 include/uapi/linux/kvm.h  |   5 +
 23 files changed, 1430 insertions(+), 123 deletions(-)

-- 
1.8.3.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/10] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-07-31 Thread Alexey Kardashevskiy
This adds special support for huge pages (16MB) in real mode.

The reference counting cannot be easily done for such pages in real
mode (when MMU is off) so we added a hash table of huge pages.
It is populated in virtual mode and get_page is called just once
per a huge page. Real mode handlers check if the requested page is
in the hash table, then no reference counting is done, otherwise
an exit to virtual mode happens. The hash table is released at KVM
exit.

At the moment the fastest card available for tests uses up to 9 huge
pages so walking through this hash table does not cost much.
However this can change and we may want to optimize this.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

---

Changes:
2013/07/12:
* removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
for KVM_BOOK3S_64

2013/06/27:
* list of huge pages replaces with hashtable for better performance
* spinlock removed from real mode and only protects insertion of new
huge [ages descriptors into the hashtable

2013/06/05:
* fixed compile error when CONFIG_IOMMU_API=n

2013/05/20:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).

Conflicts:
arch/powerpc/kernel/iommu.c

Conflicts:
arch/powerpc/kernel/iommu.c

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/kvm_host.h |  25 
 arch/powerpc/kernel/iommu.c |   6 +-
 arch/powerpc/kvm/book3s_64_vio.c| 121 ++--
 arch/powerpc/kvm/book3s_64_vio_hv.c |  32 +-
 4 files changed, 175 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 4eeaf7d..c57b25a 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -31,6 +31,7 @@
 #include linux/list.h
 #include linux/atomic.h
 #include linux/tracepoint.h
+#include linux/hashtable.h
 #include asm/kvm_asm.h
 #include asm/processor.h
 #include asm/page.h
@@ -183,9 +184,33 @@ struct kvmppc_spapr_tce_table {
u32 window_size;
struct iommu_group *grp;/* used for IOMMU groups */
struct vfio_group *vfio_grp;/* used for IOMMU groups */
+   DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
+   spinlock_t hugepages_write_lock;/* used for IOMMU groups */
struct page *pages[0];
 };
 
+/*
+ * The KVM guest can be backed with 16MB pages.
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)hash_32(gpa  24, 32)
+
+struct kvmppc_spapr_iommu_hugepage {
+   struct hlist_node hash_node;
+   unsigned long gpa;  /* Guest physical address */
+   unsigned long hpa;  /* Host physical address */
+   struct page *page;  /* page struct of the very first subpage */
+   unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
 struct kvmppc_linear_info {
void*base_virt;
unsigned longbase_pfn;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 8314c80..e4a8135 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long 
entry,
if (!pg) {
ret = -EAGAIN;
} else if (PageCompound(pg)) {
-   ret = -EAGAIN;
+   /* Hugepages will be released at KVM exit */
+   ret = 0;
} else {
if (oldtce  TCE_PCI_WRITE)
SetPageDirty(pg);
@@ -1010,6 +1011,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned 
long entry,
struct page *pg = pfn_to_page(oldtce  PAGE_SHIFT);
if (!pg) {
ret = -EAGAIN;
+   } else if (PageCompound(pg)) {
+   /* Hugepages will be released at KVM exit */
+   ret = 0;
} else

[PATCH 09/10] KVM: PPC: Add support for IOMMU in-kernel handling

2013-07-31 Thread Alexey Kardashevskiy
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
them to user space which saves time on switching to user space and back.

Both real and virtual modes are supported. The kernel tries to
handle a TCE request in the real mode, if fails it passes the request
to the virtual mode to complete the operation. If it a virtual mode
handler fails, the request is passed to user space.

The first user of this is VFIO on POWER. The external user API in VFIO
is required for this patch. The patch adds a new KVM_CAP_SPAPR_TCE_IOMMU
ioctl to associate a virtual PCI bus number (LIOBN) with an VFIO IOMMU
group fd and enable in-kernel handling of map/unmap requests.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

---

Changes:
2013/07/11:
* removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
for KVM_BOOK3S_64
* kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
for this here but the next patch for hugepages support will use it more.

2013/07/06:
* added realmode arch_spin_lock to protect TCE table from races
in real and virtual modes
* POWERPC IOMMU API is changed to support real mode
* iommu_take_ownership and iommu_release_ownership are protected by
iommu_table's locks
* VFIO external user API use rewritten
* multiple small fixes

2013/06/27:
* tce_list page is referenced now in order to protect it from accident
invalidation during H_PUT_TCE_INDIRECT execution
* added use of the external user VFIO API

2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number

2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 Documentation/virtual/kvm/api.txt   |  26 
 arch/powerpc/include/asm/kvm_host.h |   3 +
 arch/powerpc/include/asm/kvm_ppc.h  |   2 +
 arch/powerpc/include/uapi/asm/kvm.h |   7 +
 arch/powerpc/kvm/book3s_64_vio.c| 296 +++-
 arch/powerpc/kvm/book3s_64_vio_hv.c | 122 +++
 arch/powerpc/kvm/powerpc.c  |  12 ++
 7 files changed, 463 insertions(+), 5 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 1c8942a..6ae65bd 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2408,6 +2408,32 @@ an implementation for these despite the in kernel 
acceleration.
 This capability is always enabled.
 
 
+4.87 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_create_spapr_tce_iommu {
+   __u64 liobn;
+   __u32 fd;
+   __u32 flags;
+};
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+User space passes VFIO group fd. Using the external user VFIO API,
+KVM tries gets IOMMU id from passed fd. If succeeded, acceleration
+turns on. If failed, map/unmap requests are passed to user space.
+
+No flag is supported at the moment.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index b8fe3de..4eeaf7d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+   struct iommu_group *grp;/* used for IOMMU groups */
+   struct vfio_group *vfio_grp;/* used for IOMMU groups */
struct page *pages[0];
 };
 
@@ -612,6 +614,7 @@ struct kvm_vcpu_arch {
u64 busy_preempt;
 
unsigned long *tce_tmp_hpas;/* TCE cache for TCE_PUT_INDIRECT hcall 
*/
+   unsigned long tce_tmp_num;  /* Number of handled TCEs in the cache 
*/
enum {
TCERM_NONE,
TCERM_GETPAGE,
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 0ce4691..297cab5 100644

[PATCH 06/10] KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently

2013-07-31 Thread Alexey Kardashevskiy
It does not make much sense to have KVM in book3s-64bit and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s-kvm.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
#ifdef IOMMU_API in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index c55c538..3b2b761 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -59,6 +59,7 @@ config KVM_BOOK3S_64
depends on PPC_BOOK3S_64
select KVM_BOOK3S_64_HANDLER
select KVM
+   select SPAPR_TCE_IOMMU
---help---
  Support running unmodified book3s_64 and book3s_32 guest kernels
  in virtual machines on book3s_64 host processors.
-- 
1.8.3.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] powerpc/iommu: rework to support realmode

2013-07-31 Thread Alexey Kardashevskiy
The TCE tables handling may differ for real and virtual modes so
additional ppc_md.tce_build_rm/ppc_md.tce_free_rm/ppc_md.tce_flush_rm
handlers were introduced earlier.

So this adds the following:
1. support for the new ppc_md calls;
2. ability to iommu_tce_build to process mupltiple entries per
call;
3. arch_spin_lock to protect TCE table from races in both real and virtual
modes;
4. proper TCE table protection from races with the existing IOMMU code
in iommu_take_ownership/iommu_release_ownership;
5. hwaddr variable renamed to hpa as it better describes what it
actually represents;
6. iommu_tce_direction is static now as it is not called from anywhere else.

This will be used by upcoming real mode support of VFIO on POWER.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/iommu.h |   9 +-
 arch/powerpc/kernel/iommu.c  | 198 ++-
 2 files changed, 136 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index c34656a..b01bde1 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,7 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
 #ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
+   arch_spinlock_t it_rm_lock;
 #endif
 };
 
@@ -152,9 +153,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table 
*tbl,
 extern int iommu_tce_put_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce);
 extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
-   unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
-   unsigned long entry);
+   unsigned long *hpas, unsigned long npages, bool rm);
+extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+   unsigned long npages, bool rm);
 extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
 extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
@@ -164,7 +165,5 @@ extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
 extern void iommu_release_ownership(struct iommu_table *tbl);
 
-extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b20ff17..8314c80 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -903,7 +903,7 @@ void iommu_register_group(struct iommu_table *tbl,
kfree(name);
 }
 
-enum dma_data_direction iommu_tce_direction(unsigned long tce)
+static enum dma_data_direction iommu_tce_direction(unsigned long tce)
 {
if ((tce  TCE_PCI_READ)  (tce  TCE_PCI_WRITE))
return DMA_BIDIRECTIONAL;
@@ -914,7 +914,6 @@ enum dma_data_direction iommu_tce_direction(unsigned long 
tce)
else
return DMA_NONE;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_direction);
 
 void iommu_flush_tce(struct iommu_table *tbl)
 {
@@ -972,73 +971,117 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
 
-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
-   unsigned long oldtce;
-   struct iommu_pool *pool = get_pool(tbl, entry);
-
-   spin_lock((pool-lock));
-
-   oldtce = ppc_md.tce_get(tbl, entry);
-   if (oldtce  (TCE_PCI_WRITE | TCE_PCI_READ))
-   ppc_md.tce_free(tbl, entry, 1);
-   else
-   oldtce = 0;
-
-   spin_unlock((pool-lock));
-
-   return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
 int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
 {
-   unsigned long oldtce;
-   struct page *page;
-
-   for ( ; pages; --pages, ++entry) {
-   oldtce = iommu_clear_tce(tbl, entry);
-   if (!oldtce)
-   continue;
-
-   page = pfn_to_page(oldtce  PAGE_SHIFT);
-   WARN_ON(!page);
-   if (page) {
-   if (oldtce  TCE_PCI_WRITE)
-   SetPageDirty(page);
-   put_page(page);
-   }
-   }
-
-   return 0;
+   return iommu_free_tces(tbl, entry, pages, false);
 }
 EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
 
-/*
- * hwaddr is a kernel virtual address here (0xc... bazillion),
- * tce_build converts it to a physical address.
- */
+int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+   unsigned long npages, bool rm)
+{
+   int i, ret = 0, to_free = 0;
+
+   if (rm  !ppc_md.tce_free_rm)
+   return

[PATCH 05/10] powerpc: add real mode support for dma operations on powernv

2013-07-31 Thread Alexey Kardashevskiy
The existing TCE machine calls (tce_build and tce_free) only support
virtual mode as they call __raw_writeq for TCE invalidation what
fails in real mode.

This introduces tce_build_rm and tce_free_rm real mode versions
which do mostly the same but use Store Doubleword Caching Inhibited
Indexed instruction for TCE invalidation.

This new feature is going to be utilized by real mode support of VFIO.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
2013/11/07:
* added comment why stdcix cannot be used in virtual mode

2013/08/07:
* tested on p7ioc and fixed a bug with realmode addresses

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/machdep.h| 12 
 arch/powerpc/platforms/powernv/pci-ioda.c | 47 +++
 arch/powerpc/platforms/powernv/pci.c  | 38 +
 arch/powerpc/platforms/powernv/pci.h  |  3 +-
 4 files changed, 81 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 8b48090..07dd3b1 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -78,6 +78,18 @@ struct machdep_calls {
long index);
void(*tce_flush)(struct iommu_table *tbl);
 
+   /* _rm versions are for real mode use only */
+   int (*tce_build_rm)(struct iommu_table *tbl,
+long index,
+long npages,
+unsigned long uaddr,
+enum dma_data_direction direction,
+struct dma_attrs *attrs);
+   void(*tce_free_rm)(struct iommu_table *tbl,
+   long index,
+   long npages);
+   void(*tce_flush_rm)(struct iommu_table *tbl);
+
void __iomem *  (*ioremap)(phys_addr_t addr, unsigned long size,
   unsigned long flags, void *caller);
void(*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index d8140b1..5815f1d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -70,6 +70,16 @@ define_pe_printk_level(pe_err, KERN_ERR);
 define_pe_printk_level(pe_warn, KERN_WARNING);
 define_pe_printk_level(pe_info, KERN_INFO);
 
+/*
+ * stdcix is only supposed to be used in hypervisor real mode as per
+ * the architecture spec
+ */
+static inline void __raw_rm_writeq(u64 val, volatile void __iomem *paddr)
+{
+   __asm__ __volatile__(stdcix %0,0,%1
+   : : r (val), r (paddr) : memory);
+}
+
 static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
 {
unsigned long pe;
@@ -454,10 +464,13 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe 
*pe, struct pci_bus *bus)
}
 }
 
-static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
-u64 *startp, u64 *endp)
+static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
+struct iommu_table *tbl,
+u64 *startp, u64 *endp, bool rm)
 {
-   u64 __iomem *invalidate = (u64 __iomem *)tbl-it_index;
+   u64 __iomem *invalidate = rm?
+   (u64 __iomem *)pe-tce_inval_reg_phys:
+   (u64 __iomem *)tbl-it_index;
unsigned long start, end, inc;
 
start = __pa(startp);
@@ -484,7 +497,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct 
iommu_table *tbl,
 
 mb(); /* Ensure above stores are visible */
 while (start = end) {
-__raw_writeq(start, invalidate);
+   if (rm)
+   __raw_rm_writeq(start, invalidate);
+   else
+   __raw_writeq(start, invalidate);
 start += inc;
 }
 
@@ -496,10 +512,12 @@ static void pnv_pci_ioda1_tce_invalidate(struct 
iommu_table *tbl,
 
 static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 struct iommu_table *tbl,
-u64 *startp, u64 *endp)
+u64 *startp, u64 *endp, bool rm)
 {
unsigned long start, end, inc;
-   u64 __iomem *invalidate = (u64 __iomem *)tbl-it_index;
+   u64 __iomem *invalidate = rm?
+   (u64 __iomem *)pe-tce_inval_reg_phys:
+   (u64 __iomem *)tbl-it_index;
 
/* We'll invalidate DMA address in PE scope */
start = 0x2ul  60;
@@ -515,22 +533,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
pnv_ioda_pe *pe,
mb();
 
while (start = end) {
-   __raw_writeq(start, invalidate);
+   if (rm

[PATCH 07/10] KVM: PPC: Add support for multiple-TCE hcalls

2013-07-31 Thread Alexey Kardashevskiy
This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a function to convert a guest physical address to a host
virtual address in order to parse a TCE list from H_PUT_TCE_INDIRECT.

This also implements the KVM_CAP_PPC_MULTITCE capability. When present,
the hypercalls mentioned above may or may not be processed successfully
in the kernel based fast path. If they can not be handled by the kernel,
they will get passed on to user space. So user space still has to have
an implementation for these despite the in kernel acceleration.

Signed-off-by: Paul Mackerras pau...@samba.org
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

---
Changelog:
2013/08/01 (v7):
* realmode_get_page/realmode_put_page use was replaced with
get_page_unless_zero/put_page_unless_one

2013/07/11:
* addressed many, many comments from maintainers

2013/07/06:
* fixed number of wrong get_page()/put_page() calls

2013/06/27:
* fixed clear of BUSY bit in kvmppc_lookup_pte()
* H_PUT_TCE_INDIRECT does realmode_get_page() now
* KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
* updated doc

2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number

2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 Documentation/virtual/kvm/api.txt   |  26 
 arch/powerpc/include/asm/kvm_host.h |   9 ++
 arch/powerpc/include/asm/kvm_ppc.h  |  16 +-
 arch/powerpc/kvm/book3s_64_vio.c| 132 +++-
 arch/powerpc/kvm/book3s_64_vio_hv.c | 267 
 arch/powerpc/kvm/book3s_hv.c|  41 -
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   8 +-
 arch/powerpc/kvm/book3s_pr_papr.c   |  35 +
 arch/powerpc/kvm/powerpc.c  |   3 +
 9 files changed, 503 insertions(+), 34 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index ef925ea..1c8942a 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2382,6 +2382,32 @@ calls by the guest for that service will be passed to 
userspace to be
 handled.
 
 
+4.86 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significantly accelerates DMA operations for PPC KVM guests.
+User space should expect that its handlers for these hypercalls
+are not going to be called if user space previously registered LIOBN
+in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using them if hcall-multi-tce is
+present in the ibm,hypertas-functions device-tree property.
+
+The hypercalls mentioned above may or may not be processed successfully
+in the kernel based fast path. If they can not be handled by the kernel,
+they will get passed on to user space. So user space still has to have
+an implementation for these despite the in kernel acceleration.
+
+This capability is always enabled.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index af326cd..b8fe3de 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
 #include linux/kvm_para.h
 #include linux/list.h
 #include linux/atomic.h
+#include linux/tracepoint.h
 #include asm/kvm_asm.h
 #include asm/processor.h
 #include asm/page.h
@@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+   unsigned long *tce_tmp_hpas;/* TCE cache for TCE_PUT_INDIRECT hcall 
*/
+   enum {
+   TCERM_NONE

[PATCH 02/10] KVM: PPC: reserve a capability number for multitce support

2013-07-31 Thread Alexey Kardashevskiy
This is to reserve a capablity number for upcoming support
of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls
which support mulptiple DMA map/unmap operations per one call.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
2013/07/16:
* changed the number

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index acccd08..99c2533 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_ARM_EL1_32BIT 93
+#define KVM_CAP_SPAPR_MULTITCE 94
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.8.3.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/10] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO

2013-07-31 Thread Alexey Kardashevskiy
This is to reserve a capablity number for upcoming support
of VFIO-IOMMU DMA operations in real mode.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

---
Changes:
2013/07/16:
* changed the number

2013/07/11:
* changed order in a file, added comment about a gap in ioctl number

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 include/uapi/linux/kvm.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 99c2533..53c3f1f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_ARM_EL1_32BIT 93
 #define KVM_CAP_SPAPR_MULTITCE 94
+#define KVM_CAP_SPAPR_TCE_IOMMU 95
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -933,6 +934,9 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ARM_SET_DEVICE_ADDR  _IOW(KVMIO,  0xab, struct 
kvm_arm_device_addr)
 /* Available with KVM_CAP_PPC_RTAS */
 #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xac, struct 
kvm_rtas_token_args)
+/* 0xad and 0xaf are already taken */
+/* Available with KVM_CAP_SPAPR_TCE_IOMMU */
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct 
kvm_create_spapr_tce_iommu)
 
 /* ioctl for vm fd */
 #define KVM_CREATE_DEVICE_IOWR(KVMIO,  0xe0, struct kvm_create_device)
-- 
1.8.3.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/10] hashtable: add hash_for_each_possible_rcu_notrace()

2013-07-31 Thread Alexey Kardashevskiy
This adds hash_for_each_possible_rcu_notrace() which is basically
a notrace clone of hash_for_each_possible_rcu() which cannot be
used in real mode due to its tracing/debugging capability.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 include/linux/hashtable.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
index a9df51f..af8b169 100644
--- a/include/linux/hashtable.h
+++ b/include/linux/hashtable.h
@@ -174,6 +174,21 @@ static inline void hash_del_rcu(struct hlist_node *node)
member)
 
 /**
+ * hash_for_each_possible_rcu_notrace - iterate over all possible objects 
hashing
+ * to the same bucket in an rcu enabled hashtable in a rcu enabled hashtable
+ * @name: hashtable to iterate
+ * @obj: the type * to use as a loop cursor for each entry
+ * @member: the name of the hlist_node within the struct
+ * @key: the key of the objects to iterate over
+ *
+ * This is the same as hash_for_each_possible_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define hash_for_each_possible_rcu_notrace(name, obj, member, key) \
+   hlist_for_each_entry_rcu_notrace(obj, name[hash_min(key, 
HASH_BITS(name))],\
+   member)
+
+/**
  * hash_for_each_possible_safe - iterate over all possible objects hashing to 
the
  * same bucket safe against removals
  * @name: hashtable to iterate
-- 
1.8.3.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >