Re: [PATCH V4 6/7] kvm tools: Add PPC64 PCI Host Bridge
On 01/02/12 14:40, David Gibson wrote: On Tue, Jan 31, 2012 at 05:34:41PM +1100, Matt Evans wrote: This provides the PCI bridge, definitions for the address layout of the windows and wires in IRQs. Once PCI devices are all registered, they are enumerated and DT nodes generated for each. Signed-off-by: Matt Evans m...@ozlabs.org For the bits derived from my qemu code: Signed-off-by: David Gibson da...@gibson.dropbear.id.au For the bits derived from my qemu code: Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] adding MSI/MSIX for PCI on POWER
The following patches add MSIX support for PCI on POWER. The first aim is virtio-pci so it was tested. It will also support VFIO when it becomes available in public. Alexey Kardashevskiy (3): msi/msix: added functions to API to set up message address and data pseries: added allocator for a block of IRQs pseries pci: added MSI/MSIX support hw/msi.c | 14 +++ hw/msi.h |1 + hw/msix.c | 10 ++ hw/msix.h |3 + hw/spapr.c | 26 +- hw/spapr.h |1 + hw/spapr_pci.c | 266 +-- hw/spapr_pci.h | 13 +++- trace-events |9 ++ 9 files changed, 331 insertions(+), 12 deletions(-) -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] msi/msix: added functions to API to set up message address and data
Normally QEMU expects the guest to initialize MSI/MSIX vectors. However on POWER the guest uses RTAS subsystem to configure MSI/MSIX and does not write these vectors to device's config space or MSIX BAR. On the other hand, msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so we have to write correct vectors to the devices in order not to change every user of MSI/MSIX. The first aim is to support MSIX for virtio-pci on POWER. There is another patch for POWER coming which introduces a special memory region where MSI/MSIX vectors point to. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 14 ++ hw/msi.h |1 + hw/msix.c | 10 ++ hw/msix.h |3 +++ 4 files changed, 28 insertions(+), 0 deletions(-) diff --git a/hw/msi.c b/hw/msi.c index 5d6ceb6..124878a 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -358,3 +358,17 @@ unsigned int msi_nr_vectors_allocated(const PCIDevice *dev) uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); return msi_nr_vectors(flags); } + +void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), data); +} + diff --git a/hw/msi.h b/hw/msi.h index 3040bb0..0acf434 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -34,6 +34,7 @@ void msi_reset(PCIDevice *dev); void msi_notify(PCIDevice *dev, unsigned int vector); void msi_write_config(PCIDevice *dev, uint32_t addr, uint32_t val, int len); unsigned int msi_nr_vectors_allocated(const PCIDevice *dev); +void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data); static inline bool msi_present(const PCIDevice *dev) { diff --git a/hw/msix.c b/hw/msix.c index 3835eaa..c57c299 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -414,3 +414,13 @@ void msix_unuse_all_vectors(PCIDevice *dev) return; msix_free_irq_entries(dev); } + +void msix_set_address_data(PCIDevice *dev, int vector, + uint64_t address, uint32_t data) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT; +} + diff --git a/hw/msix.h b/hw/msix.h index 5aba22b..e6bb696 100644 --- a/hw/msix.h +++ b/hw/msix.h @@ -29,4 +29,7 @@ void msix_notify(PCIDevice *dev, unsigned vector); void msix_reset(PCIDevice *dev); +void msix_set_address_data(PCIDevice *dev, int vector, + uint64_t address, uint32_t data); + #endif -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] pseries: added allocator for a block of IRQs
The patch adds a simple helper which allocates a consecutive sequence of IRQs calling spapr_allocate_irq for each and checks that allocated IRQs go consequently. The patch is required for upcoming support of MSI/MSIX on POWER. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/spapr.c | 19 +++ hw/spapr.h |1 + 2 files changed, 20 insertions(+), 0 deletions(-) diff --git a/hw/spapr.c b/hw/spapr.c index 2e0b4b8..ef6ffcb 100644 --- a/hw/spapr.c +++ b/hw/spapr.c @@ -113,6 +113,25 @@ qemu_irq spapr_allocate_irq(uint32_t hint, uint32_t *irq_num, return qirq; } +/* Allocate block of consequtive IRQs, returns a number of the first */ +int spapr_allocate_irq_block(uint32_t num, enum xics_irq_type type) +{ +int i, ret; +uint32_t irq = -1; + +for (i = 0; i num; ++i) { +if (!spapr_allocate_irq(0, irq, type)) { +return -1; +} +if (0 == i) { +ret = irq; +} else if (ret + i != irq) { +return -1; +} +} +return ret; +} + static int spapr_set_associativity(void *fdt, sPAPREnvironment *spapr) { int ret = 0, offset; diff --git a/hw/spapr.h b/hw/spapr.h index 502393a..408b470 100644 --- a/hw/spapr.h +++ b/hw/spapr.h @@ -289,6 +289,7 @@ target_ulong spapr_hypercall(CPUPPCState *env, target_ulong opcode, qemu_irq spapr_allocate_irq(uint32_t hint, uint32_t *irq_num, enum xics_irq_type type); +int spapr_allocate_irq_block(uint32_t num, enum xics_irq_type type); static inline qemu_irq spapr_allocate_msi(uint32_t hint, uint32_t *irq_num) { -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] pseries pci: added MSI/MSIX support
virtio-pci expects the guest to set up MSI message address and data, and to do other initialization such as a vector number negotiation. It also notifies the guest via writing an MSI message to a previously set address. This patch includes: 1. RTAS call ibm,change-msi which sets up number of MSI vectors per a device. Note that this call may configure and return lesser number of vectors than requested. 2. RTAS call ibm,query-interrupt-source-number which translates MSI vector to interrupt controller (XICS) IRQ number. 3. A config_space_address to msi_table map to provide IRQ resolving from config-address as MSI RTAS calls take a PCI config space address as an identifier. 4. A MSIX memory region is added to catch msi_notify()/msix_notiry() from virtio-pci and pass them to the guest via qemu_irq_pulse(). This patch depends on the msi/msix: added functions to API to set up message address and data patch. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/spapr.c |9 ++- hw/spapr_pci.c | 266 +-- hw/spapr_pci.h | 13 +++- trace-events |9 ++ 4 files changed, 284 insertions(+), 13 deletions(-) diff --git a/hw/spapr.c b/hw/spapr.c index ef6ffcb..35ad075 100644 --- a/hw/spapr.c +++ b/hw/spapr.c @@ -43,6 +43,7 @@ #include hw/spapr_vio.h #include hw/spapr_pci.h #include hw/xics.h +#include hw/msi.h #include kvm.h #include kvm_ppc.h @@ -82,6 +83,7 @@ #define SPAPR_PCI_MEM_WIN_ADDR (0x100ULL + 0xA000) #define SPAPR_PCI_MEM_WIN_SIZE 0x2000 #define SPAPR_PCI_IO_WIN_ADDR (0x100ULL + 0x8000) +#define SPAPR_PCI_MSI_WIN_ADDR (0x100ULL + 0x9000) #define PHANDLE_XICP0x @@ -116,7 +118,7 @@ qemu_irq spapr_allocate_irq(uint32_t hint, uint32_t *irq_num, /* Allocate block of consequtive IRQs, returns a number of the first */ int spapr_allocate_irq_block(uint32_t num, enum xics_irq_type type) { -int i, ret; +int i, ret = 0; uint32_t irq = -1; for (i = 0; i num; ++i) { @@ -690,6 +692,8 @@ static void ppc_spapr_init(ram_addr_t ram_size, long pteg_shift = 17; char *filename; +msi_supported = true; + spapr = g_malloc0(sizeof(*spapr)); QLIST_INIT(spapr-phbs); @@ -804,7 +808,8 @@ static void ppc_spapr_init(ram_addr_t ram_size, spapr_create_phb(spapr, pci, SPAPR_PCI_BUID, SPAPR_PCI_MEM_WIN_ADDR, SPAPR_PCI_MEM_WIN_SIZE, - SPAPR_PCI_IO_WIN_ADDR); + SPAPR_PCI_IO_WIN_ADDR, + SPAPR_PCI_MSI_WIN_ADDR); for (i = 0; i nb_nics; i++) { NICInfo *nd = nd_table[i]; diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c index 93017cd..21fbc50 100644 --- a/hw/spapr_pci.c +++ b/hw/spapr_pci.c @@ -24,31 +24,46 @@ */ #include hw.h #include pci.h +#include msix.h +#include msi.h #include pci_host.h #include hw/spapr.h #include hw/spapr_pci.h #include exec-memory.h #include libfdt.h +#include trace.h #include hw/pci_internals.h -static PCIDevice *find_dev(sPAPREnvironment *spapr, - uint64_t buid, uint32_t config_addr) +static sPAPRPHBState *find_phb(sPAPREnvironment *spapr, uint64_t buid) { -DeviceState *qdev; -int devfn = (config_addr 8) 0xFF; sPAPRPHBState *phb; QLIST_FOREACH(phb, spapr-phbs, list) { if (phb-buid != buid) { continue; } +return phb; +} -QTAILQ_FOREACH(qdev, phb-host_state.bus-qbus.children, sibling) { -PCIDevice *dev = (PCIDevice *)qdev; -if (dev-devfn == devfn) { -return dev; -} +return NULL; +} + +static PCIDevice *find_dev(sPAPREnvironment *spapr, uint64_t buid, + uint32_t config_addr) +{ +sPAPRPHBState *phb = find_phb(spapr, buid); +DeviceState *qdev; +int devfn = (config_addr 8) 0xFF; + +if (!phb) { +return NULL; +} + +QTAILQ_FOREACH(qdev, phb-host_state.bus-qbus.children, sibling) { +PCIDevice *dev = (PCIDevice *)qdev; +if (dev-devfn == devfn) { +return dev; } } @@ -138,6 +153,220 @@ static void rtas_write_pci_config(sPAPREnvironment *spapr, rtas_st(rets, 0, 0); } +/* + * Initializes req_num vectors for a device. + * The code assumes that MSI/MSIX is enabled in the config space + * as a result of msix_init() or msi_init(). + */ +static int spapr_pci_config_msi(sPAPRPHBState *ph, int ndev, +PCIDevice *pdev, bool msix, unsigned req_num) +{ +unsigned i; +int irq; +uint64_t msi_address; +uint32_t config_addr = pdev-devfn 8; + +/* Disabling - nothing to do */ +if (0 == req_num) { +return 0; +} + +/* Enabling! */ +if (ph-msi_table[ndev].nvec (req_num != ph-msi_table[ndev].nvec)) { +/* Unexpected behaviour */ +fprintf(stderr, Cannot reuse cached MSI config
[PATCH] pseries pci: removed redundand busdev
The PCIHostState struct already contains SysBusDevice so the one in sPAPRPHBState has to go. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/spapr_pci.c |4 ++-- hw/spapr_pci.h |1 - 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c index 75943cf..1c0b605 100644 --- a/hw/spapr_pci.c +++ b/hw/spapr_pci.c @@ -215,7 +215,7 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque, static int spapr_phb_init(SysBusDevice *s) { -sPAPRPHBState *phb = FROM_SYSBUS(sPAPRPHBState, s); +sPAPRPHBState *phb = DO_UPCAST(sPAPRPHBState, host_state.busdev, s); char *namebuf; int i; PCIBus *bus; @@ -253,7 +253,7 @@ static int spapr_phb_init(SysBusDevice *s) memory_region_add_subregion(get_system_memory(), phb-io_win_addr, phb-iowindow); -bus = pci_register_bus(phb-busdev.qdev, +bus = pci_register_bus(phb-host_state.busdev.qdev, phb-busname ? phb-busname : phb-dtbusname, pci_spapr_set_irq, pci_spapr_map_irq, phb, phb-memspace, phb-iospace, diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h index d9e46e2..a141764 100644 --- a/hw/spapr_pci.h +++ b/hw/spapr_pci.h @@ -28,7 +28,6 @@ #include hw/xics.h typedef struct sPAPRPHBState { -SysBusDevice busdev; PCIHostState host_state; uint64_t buid; -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] pseries pci: spapr_populate_pci_devices renamed to spapr_populate_pci_dt
spapr_populate_pci_devices() populates the device tree only with bus properties and has nothing to do with the devices on it as PCI BAR allocation is done by the system firmware (SLOF). New name - spapr_populate_pci_dt() - describes the functionality better. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/spapr.c |2 +- hw/spapr_pci.c |6 +++--- hw/spapr_pci.h |6 +++--- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/hw/spapr.c b/hw/spapr.c index 47b26ee..2e0b4b8 100644 --- a/hw/spapr.c +++ b/hw/spapr.c @@ -551,7 +551,7 @@ static void spapr_finalize_fdt(sPAPREnvironment *spapr, } QLIST_FOREACH(phb, spapr-phbs, list) { -ret = spapr_populate_pci_devices(phb, PHANDLE_XICP, fdt); +ret = spapr_populate_pci_dt(phb, PHANDLE_XICP, fdt); } if (ret 0) { diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c index 1c0b605..269dbbf 100644 --- a/hw/spapr_pci.c +++ b/hw/spapr_pci.c @@ -345,9 +345,9 @@ void spapr_create_phb(sPAPREnvironment *spapr, #define b_fff(x)b_x((x), 8, 3) /* function number */ #define b_(x) b_x((x), 0, 8) /* register number */ -int spapr_populate_pci_devices(sPAPRPHBState *phb, - uint32_t xics_phandle, - void *fdt) +int spapr_populate_pci_dt(sPAPRPHBState *phb, + uint32_t xics_phandle, + void *fdt) { int bus_off, i, j; char nodename[256]; diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h index a141764..dd66f4b 100644 --- a/hw/spapr_pci.h +++ b/hw/spapr_pci.h @@ -55,8 +55,8 @@ void spapr_create_phb(sPAPREnvironment *spapr, uint64_t mem_win_addr, uint64_t mem_win_size, uint64_t io_win_addr); -int spapr_populate_pci_devices(sPAPRPHBState *phb, - uint32_t xics_phandle, - void *fdt); +int spapr_populate_pci_dt(sPAPRPHBState *phb, + uint32_t xics_phandle, + void *fdt); #endif /* __HW_SPAPR_PCI_H__ */ -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] trace: added ability to comment out events in the list
It is convenient for debug to be able to switch on/off some events easily. The only possibility now is to remove event name from the file completely and type it again when we want it back. The patch adds '#' symbol handling as a comment specifier. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- trace/control.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/trace/control.c b/trace/control.c index 4c5527d..22d5863 100644 --- a/trace/control.c +++ b/trace/control.c @@ -27,6 +27,9 @@ void trace_backend_init_events(const char *fname) size_t len = strlen(line_buf); if (len 1) { /* skip empty lines */ line_buf[len - 1] = '\0'; +if ('#' == line_buf[0]) { /* skip commented lines */ +continue; +} if (!trace_event_set_state(line_buf, true)) { fprintf(stderr, error: trace event '%s' does not exist\n, line_buf); -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] adding MSI/MSIX for PCI on POWER
Forgot to CC: someone :) On 14/06/12 14:29, Alexey Kardashevskiy wrote: The following patches add MSIX support for PCI on POWER. The first aim is virtio-pci so it was tested. It will also support VFIO when it becomes available in public. Alexey Kardashevskiy (3): msi/msix: added functions to API to set up message address and data pseries: added allocator for a block of IRQs pseries pci: added MSI/MSIX support hw/msi.c | 14 +++ hw/msi.h |1 + hw/msix.c | 10 ++ hw/msix.h |3 + hw/spapr.c | 26 +- hw/spapr.h |1 + hw/spapr_pci.c | 266 +-- hw/spapr_pci.h | 13 +++- trace-events |9 ++ 9 files changed, 331 insertions(+), 12 deletions(-) -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/3] msi/msix: added functions to API to set up message address and data
On 14/06/12 14:56, Alex Williamson wrote: On Thu, 2012-06-14 at 14:31 +1000, Alexey Kardashevskiy wrote: Normally QEMU expects the guest to initialize MSI/MSIX vectors. However on POWER the guest uses RTAS subsystem to configure MSI/MSIX and does not write these vectors to device's config space or MSIX BAR. On the other hand, msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so we have to write correct vectors to the devices in order not to change every user of MSI/MSIX. The first aim is to support MSIX for virtio-pci on POWER. There is another patch for POWER coming which introduces a special memory region where MSI/MSIX vectors point to. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 14 ++ hw/msi.h |1 + hw/msix.c | 10 ++ hw/msix.h |3 +++ 4 files changed, 28 insertions(+), 0 deletions(-) diff --git a/hw/msi.c b/hw/msi.c index 5d6ceb6..124878a 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -358,3 +358,17 @@ unsigned int msi_nr_vectors_allocated(const PCIDevice *dev) uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); return msi_nr_vectors(flags); } + +void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), data); +} Why not make it msi_set_message() and pass MSIMessage? I'd be great if you tossed in a msi_get_message() as well, I think we need it to be able to do a kvm_irqchip_add_msi_route() with MSI. Thanks, I am missing the point. What is that MSIMessage? It is just an address and data, making a struct from this is a bit too much :) I am totally unfamiliar with kvm_irqchip_add_msi_route to see the bigger picture, sorry. Alex + diff --git a/hw/msi.h b/hw/msi.h index 3040bb0..0acf434 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -34,6 +34,7 @@ void msi_reset(PCIDevice *dev); void msi_notify(PCIDevice *dev, unsigned int vector); void msi_write_config(PCIDevice *dev, uint32_t addr, uint32_t val, int len); unsigned int msi_nr_vectors_allocated(const PCIDevice *dev); +void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data); static inline bool msi_present(const PCIDevice *dev) { diff --git a/hw/msix.c b/hw/msix.c index 3835eaa..c57c299 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -414,3 +414,13 @@ void msix_unuse_all_vectors(PCIDevice *dev) return; msix_free_irq_entries(dev); } + +void msix_set_address_data(PCIDevice *dev, int vector, + uint64_t address, uint32_t data) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT; +} + diff --git a/hw/msix.h b/hw/msix.h index 5aba22b..e6bb696 100644 --- a/hw/msix.h +++ b/hw/msix.h @@ -29,4 +29,7 @@ void msix_notify(PCIDevice *dev, unsigned vector); void msix_reset(PCIDevice *dev); +void msix_set_address_data(PCIDevice *dev, int vector, + uint64_t address, uint32_t data); + #endif -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/3] msi/msix: added functions to API to set up message address and data
On 14/06/12 15:38, Alex Williamson wrote: On Thu, 2012-06-14 at 15:17 +1000, Alexey Kardashevskiy wrote: On 14/06/12 14:56, Alex Williamson wrote: On Thu, 2012-06-14 at 14:31 +1000, Alexey Kardashevskiy wrote: Normally QEMU expects the guest to initialize MSI/MSIX vectors. However on POWER the guest uses RTAS subsystem to configure MSI/MSIX and does not write these vectors to device's config space or MSIX BAR. On the other hand, msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so we have to write correct vectors to the devices in order not to change every user of MSI/MSIX. The first aim is to support MSIX for virtio-pci on POWER. There is another patch for POWER coming which introduces a special memory region where MSI/MSIX vectors point to. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 14 ++ hw/msi.h |1 + hw/msix.c | 10 ++ hw/msix.h |3 +++ 4 files changed, 28 insertions(+), 0 deletions(-) diff --git a/hw/msi.c b/hw/msi.c index 5d6ceb6..124878a 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -358,3 +358,17 @@ unsigned int msi_nr_vectors_allocated(const PCIDevice *dev) uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); return msi_nr_vectors(flags); } + +void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), data); +} Why not make it msi_set_message() and pass MSIMessage? I'd be great if you tossed in a msi_get_message() as well, I think we need it to be able to do a kvm_irqchip_add_msi_route() with MSI. Thanks, I am missing the point. What is that MSIMessage? It is just an address and data, making a struct from this is a bit too much :) I am totally unfamiliar with kvm_irqchip_add_msi_route to see the bigger picture, sorry. MSIVectorUseNotifier passes a MSIMessage back to the device when a vector is unmasked. We can then add a route in KVM for that message with kvm_irqchip_add_msi_route. Finally, kvm_irqchip_add_irqfd allows us to connect that MSI route to an eventfd, such as from virtio or vfio. Then MSI eventfds can bypass qemu and be injected directly into KVM and on into the guest. So we seem to already have some standardization on passing address/data via an MSIMessage. You need a set interface, I need a get interface. msix already has a static msix_get_message(). So I'd suggest that an exported get/set_message for each seems like the right way to go. Thanks, Ok. Slowly :) What QEMU tree are you talking about? git, branch? There is neither MSIVectorUseNotifier nor MSIMessage in your or mine trees. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] trace: added ability to comment out events in the list
On 14/06/12 23:18, Stefan Hajnoczi wrote: On Thu, Jun 14, 2012 at 02:41:40PM +1000, Alexey Kardashevskiy wrote: It is convenient for debug to be able to switch on/off some events easily. The only possibility now is to remove event name from the file completely and type it again when we want it back. The patch adds '#' symbol handling as a comment specifier. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- trace/control.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) Thanks, applied to the tracing patches tree: https://github.com/stefanha/qemu/commits/tracing Cannot find it there though :) -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] msi/msix: added functions to API to set up message address, and data
Ok, another try. Is it any better now? :) Normally QEMU expects the guest to initialize MSI/MSIX vectors. However on POWER the guest uses RTAS subsystem to configure MSI/MSIX and does not write these vectors to device's config space or MSIX BAR. On the other hand, msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so we have to write correct vectors to the devices in order not to change every user of MSI/MSIX. The first aim is to support MSIX for virtio-pci on POWER. There is another patch for POWER coming which introduces a special memory region where MSI/MSIX vectors point to. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 14 ++ hw/msi.h |1 + hw/msix.c |8 hw/msix.h |3 +++ 4 files changed, 26 insertions(+) diff --git a/hw/msi.c b/hw/msi.c index 5233204..c7b3e6a 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -363,3 +363,17 @@ unsigned int msi_nr_vectors_allocated(const PCIDevice *dev) uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); return msi_nr_vectors(flags); } + +void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), data); +} + diff --git a/hw/msi.h b/hw/msi.h index 75747ab..353386e 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -39,6 +39,7 @@ void msi_reset(PCIDevice *dev); void msi_notify(PCIDevice *dev, unsigned int vector); void msi_write_config(PCIDevice *dev, uint32_t addr, uint32_t val, int len); unsigned int msi_nr_vectors_allocated(const PCIDevice *dev); +void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data); static inline bool msi_present(const PCIDevice *dev) { diff --git a/hw/msix.c b/hw/msix.c index ded3c55..08e773d 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -526,3 +526,11 @@ void msix_unset_vector_notifiers(PCIDevice *dev) dev-msix_vector_use_notifier = NULL; dev-msix_vector_release_notifier = NULL; } +void msix_set_address_data(PCIDevice *dev, int vector, + uint64_t address, uint32_t data) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT; +} diff --git a/hw/msix.h b/hw/msix.h index 50aee82..901f101 100644 --- a/hw/msix.h +++ b/hw/msix.h @@ -35,4 +35,7 @@ int msix_set_vector_notifiers(PCIDevice *dev, MSIVectorUseNotifier use_notifier, MSIVectorReleaseNotifier release_notifier); void msix_unset_vector_notifiers(PCIDevice *dev); +void msix_set_address_data(PCIDevice *dev, int vector, + uint64_t address, uint32_t data); + #endif -- 1.7.10 On 14/06/12 15:45, Jan Kiszka wrote: On 2012-06-14 07:17, Alexey Kardashevskiy wrote: On 14/06/12 14:56, Alex Williamson wrote: On Thu, 2012-06-14 at 14:31 +1000, Alexey Kardashevskiy wrote: Normally QEMU expects the guest to initialize MSI/MSIX vectors. However on POWER the guest uses RTAS subsystem to configure MSI/MSIX and does not write these vectors to device's config space or MSIX BAR. On the other hand, msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so we have to write correct vectors to the devices in order not to change every user of MSI/MSIX. The first aim is to support MSIX for virtio-pci on POWER. There is another patch for POWER coming which introduces a special memory region where MSI/MSIX vectors point to. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 14 ++ hw/msi.h |1 + hw/msix.c | 10 ++ hw/msix.h |3 +++ 4 files changed, 28 insertions(+), 0 deletions(-) diff --git a/hw/msi.c b/hw/msi.c index 5d6ceb6..124878a 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -358,3 +358,17 @@ unsigned int msi_nr_vectors_allocated(const PCIDevice *dev) uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); return msi_nr_vectors(flags); } + +void msi_set_address_data(PCIDevice *dev, uint64_t address, uint16_t data) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), data); +} Why not make it msi_set_message
[PATCH] msi/msix: added public API to set/get MSI message address, and data
agrhhh. sha1 of the patch changed after rebasing :) Added (msi|msix)_(set|get)_message() function for whoever might want to use them. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. As only set* function are required by now, the get functions were added or made public for a symmetry. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 29 + hw/msi.h |2 ++ hw/msix.c | 11 ++- hw/msix.h |3 +++ 4 files changed, 44 insertions(+), 1 deletion(-) diff --git a/hw/msi.c b/hw/msi.c index 5233204..9ad84a4 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,35 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool msi64bit) return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32); } +MSIMessage msi_get_message(PCIDevice *dev) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; +MSIMessage msg; + +if (msi64bit) { +msg.address = pci_get_quad(dev-config + msi_address_lo_off(dev)); +} else { +msg.address = pci_get_long(dev-config + msi_address_lo_off(dev)); +} +msg.data = pci_get_word(dev-config + msi_data_off(dev, msi64bit)); + +return msg; +} + +void msi_set_message(PCIDevice *dev, MSIMessage msg) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data); +} + bool msi_enabled(const PCIDevice *dev) { return msi_present(dev) diff --git a/hw/msi.h b/hw/msi.h index 75747ab..4b0f4f8 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -31,6 +31,8 @@ struct MSIMessage { extern bool msi_supported; +MSIMessage msi_get_message(PCIDevice *dev); +void msi_set_message(PCIDevice *dev, MSIMessage msg); bool msi_enabled(const PCIDevice *dev); int msi_init(struct PCIDevice *dev, uint8_t offset, unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask); diff --git a/hw/msix.c b/hw/msix.c index ded3c55..9e8d8bb 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -35,7 +35,7 @@ #define MSIX_PAGE_PENDING (MSIX_PAGE_SIZE / 2) #define MSIX_MAX_ENTRIES 32 -static MSIMessage msix_get_message(PCIDevice *dev, unsigned vector) +MSIMessage msix_get_message(PCIDevice *dev, unsigned vector) { uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; MSIMessage msg; @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned vector) return msg; } +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; + +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT; +} + /* Add MSI-X capability to the config space for the device. */ /* Given a bar and its size, add MSI-X table on top of it * and fill MSI-X capability in the config space. diff --git a/hw/msix.h b/hw/msix.h index 50aee82..3374cf8 100644 --- a/hw/msix.h +++ b/hw/msix.h @@ -4,6 +4,9 @@ #include qemu-common.h #include pci.h +MSIMessage msix_get_message(PCIDevice *dev, unsigned vector); +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg); + int msix_init(PCIDevice *pdev, unsigned short nentries, MemoryRegion *bar, unsigned bar_nr, unsigned bar_size); -- 1.7.10 On 21/06/12 16:53, Jan Kiszka wrote: On 2012-06-21 08:46, Alexey Kardashevskiy wrote: Ok, another try. Is it any better now? :) No - posted the old version accidentally? Jan Normally QEMU expects the guest to initialize MSI/MSIX vectors. However on POWER the guest uses RTAS subsystem to configure MSI/MSIX and does not write these vectors to device's config space or MSIX BAR. On the other hand, msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so we have to write correct vectors to the devices in order not to change every user of MSI/MSIX. The first aim is to support MSIX for virtio-pci on POWER. There is another patch for POWER coming which introduces a special memory region where MSI/MSIX vectors point to. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 14 ++ hw/msi.h |1 + hw/msix.c |8
Re: [PATCH] msi/msix: added public API to set/get MSI message address, and data
On 21/06/12 20:38, Jan Kiszka wrote: On 2012-06-21 12:28, Alexey Kardashevskiy wrote: On 21/06/12 17:39, Jan Kiszka wrote: On 2012-06-21 09:18, Alexey Kardashevskiy wrote: agrhhh. sha1 of the patch changed after rebasing :) Added (msi|msix)_(set|get)_message() function for whoever might want to use them. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. As only set* function are required by now, the get functions were added or made public for a symmetry. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 29 + hw/msi.h |2 ++ hw/msix.c | 11 ++- hw/msix.h |3 +++ 4 files changed, 44 insertions(+), 1 deletion(-) diff --git a/hw/msi.c b/hw/msi.c index 5233204..9ad84a4 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,35 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool msi64bit) return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32); } +MSIMessage msi_get_message(PCIDevice *dev) MSIMessage msi_get_message(PCIDevice *dev, unsigned vector) Who/how/why is going to calculate the vector here? +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; +MSIMessage msg; + +if (msi64bit) { +msg.address = pci_get_quad(dev-config + msi_address_lo_off(dev)); +} else { +msg.address = pci_get_long(dev-config + msi_address_lo_off(dev)); +} +msg.data = pci_get_word(dev-config + msi_data_off(dev, msi64bit)); And I have this here in addition: unsigned int nr_vectors = msi_nr_vectors(flags); ... if (nr_vectors 1) { msg.data = ~(nr_vectors - 1); msg.data |= vector; } See PCI spec and existing code. What for? I really do not get it why someone might want to read something but not real value. What PCI code should I look? I'm not sure what your use case for reading the message is. For KVM device assignment it is preparing an alternative message delivery path for MSI vectors. And for this we will need vector notifier support for MSI as well. You can check the MSI-X code for corresponding use cases of msix_get_message. And when we already have msi_get_message, another logical use case is msi_notify. See msix.c again. Aaaa. I have no case for reading the message. All I need is writing. And I want it public as I want to use it from hw/spapr_pci.c. You suggested to add reading, I added get to be _symmetric_ to set (get returns what set wrote). You want a different thing which I can do but it is not msi_get_message(), it is something like msi_prepare_message(MSImessage msg) or msi_set_vector(uint16_t data) or simply internal kitchen of msi_notify(). Still can do what you suggested, it just does not seem right. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] msi/msix: added API to set MSI message address and data
Added (msi|msix)_set_message() functions. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 13 + hw/msi.h |1 + hw/msix.c |9 + hw/msix.h |2 ++ 4 files changed, 25 insertions(+) diff --git a/hw/msi.c b/hw/msi.c index 5233204..cc6102f 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool msi64bit) return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32); } +void msi_set_message(PCIDevice *dev, MSIMessage msg) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data); +} + bool msi_enabled(const PCIDevice *dev) { return msi_present(dev) diff --git a/hw/msi.h b/hw/msi.h index 75747ab..6ec1f99 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -31,6 +31,7 @@ struct MSIMessage { extern bool msi_supported; +void msi_set_message(PCIDevice *dev, MSIMessage msg); bool msi_enabled(const PCIDevice *dev); int msi_init(struct PCIDevice *dev, uint8_t offset, unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask); diff --git a/hw/msix.c b/hw/msix.c index ded3c55..5f7d6d3 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned vector) return msg; } +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; + +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT; +} + /* Add MSI-X capability to the config space for the device. */ /* Given a bar and its size, add MSI-X table on top of it * and fill MSI-X capability in the config space. diff --git a/hw/msix.h b/hw/msix.h index 50aee82..26a437e 100644 --- a/hw/msix.h +++ b/hw/msix.h @@ -4,6 +4,8 @@ #include qemu-common.h #include pci.h +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg); + int msix_init(PCIDevice *pdev, unsigned short nentries, MemoryRegion *bar, unsigned bar_nr, unsigned bar_size); -- 1.7.10 ps. double '-' and git version is an end-of-patch scissor as I read somewhere, cannot recall where exactly :) On 21/06/12 20:56, Jan Kiszka wrote: On 2012-06-21 12:50, Alexey Kardashevskiy wrote: On 21/06/12 20:38, Jan Kiszka wrote: On 2012-06-21 12:28, Alexey Kardashevskiy wrote: On 21/06/12 17:39, Jan Kiszka wrote: On 2012-06-21 09:18, Alexey Kardashevskiy wrote: agrhhh. sha1 of the patch changed after rebasing :) Added (msi|msix)_(set|get)_message() function for whoever might want to use them. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. As only set* function are required by now, the get functions were added or made public for a symmetry. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 29 + hw/msi.h |2 ++ hw/msix.c | 11 ++- hw/msix.h |3 +++ 4 files changed, 44 insertions(+), 1 deletion(-) diff --git a/hw/msi.c b/hw/msi.c index 5233204..9ad84a4 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,35 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool msi64bit) return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32); } +MSIMessage msi_get_message(PCIDevice *dev) MSIMessage msi_get_message(PCIDevice *dev, unsigned vector) Who/how/why is going to calculate the vector here? +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; +MSIMessage msg; + +if (msi64bit) { +msg.address = pci_get_quad(dev-config + msi_address_lo_off(dev)); +} else { +msg.address = pci_get_long(dev-config + msi_address_lo_off(dev)); +} +msg.data = pci_get_word(dev-config
Re: [PATCH] msi/msix: added API to set MSI message address and data
On 21/06/12 21:49, Jan Kiszka wrote: On 2012-06-21 13:39, Alexey Kardashevskiy wrote: Added (msi|msix)_set_message() functions. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 13 + hw/msi.h |1 + hw/msix.c |9 + hw/msix.h |2 ++ 4 files changed, 25 insertions(+) diff --git a/hw/msi.c b/hw/msi.c index 5233204..cc6102f 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool msi64bit) return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32); } +void msi_set_message(PCIDevice *dev, MSIMessage msg) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data); +} + bool msi_enabled(const PCIDevice *dev) { return msi_present(dev) diff --git a/hw/msi.h b/hw/msi.h index 75747ab..6ec1f99 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -31,6 +31,7 @@ struct MSIMessage { extern bool msi_supported; +void msi_set_message(PCIDevice *dev, MSIMessage msg); bool msi_enabled(const PCIDevice *dev); int msi_init(struct PCIDevice *dev, uint8_t offset, unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask); diff --git a/hw/msix.c b/hw/msix.c index ded3c55..5f7d6d3 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned vector) return msg; } +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; + +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT; +} + /* Add MSI-X capability to the config space for the device. */ /* Given a bar and its size, add MSI-X table on top of it * and fill MSI-X capability in the config space. diff --git a/hw/msix.h b/hw/msix.h index 50aee82..26a437e 100644 --- a/hw/msix.h +++ b/hw/msix.h @@ -4,6 +4,8 @@ #include qemu-common.h #include pci.h +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg); + int msix_init(PCIDevice *pdev, unsigned short nentries, MemoryRegion *bar, unsigned bar_nr, unsigned bar_size); Interface looks good as fas as I can tell (can't asses the POWER need for clearing the mask bit on msix_set_message). I do not know exactly how x86 works (who/how allocates addresses for MSI/MSIX). On POWER at the moment I did the following thing in QEMU: - registered memory_region_init_io at some big address which the guest won't use, it is just for QEMU - put address from the previous step to the MSIX BAR via msix_set_message() when msi is being configured - then the sequence looks like: - vfio_msi_interrupt() calls msix_notify() - msix_notify() checks if it is masked via msix_is_masked() - and here PCI_MSIX_ENTRY_CTRL_MASKBIT must be unset - stl_le_phys() - here I get a notification in my MemoryRegionOps::write() and do qemu_irq_pulse() 2 reasons to do that: 1) I did not have to change either msix or vfio - cool for submitting patches; 2) neither POWER guest or qemu changes the msi or msix PCI config (it is done by different mechanism called RTAS), so I have to do this myself to support 1) and I do not have to care about someone breaking my settings -- 1.7.10 ps. double '-' and git version is an end-of-patch scissor as I read somewhere, cannot recall where exactly Check man git-am. Ahhh. Confused end-of-message with end-of-patch. I'll repost it. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] msi/msix: added API to set MSI message address and data
Added (msi|msix)_set_message() function for whoever might want to use them. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 13 + hw/msi.h |1 + hw/msix.c |9 + hw/msix.h |2 ++ 4 files changed, 25 insertions(+) diff --git a/hw/msi.c b/hw/msi.c index 5233204..cc6102f 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool msi64bit) return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32); } +void msi_set_message(PCIDevice *dev, MSIMessage msg) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data); +} + bool msi_enabled(const PCIDevice *dev) { return msi_present(dev) diff --git a/hw/msi.h b/hw/msi.h index 75747ab..6ec1f99 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -31,6 +31,7 @@ struct MSIMessage { extern bool msi_supported; +void msi_set_message(PCIDevice *dev, MSIMessage msg); bool msi_enabled(const PCIDevice *dev); int msi_init(struct PCIDevice *dev, uint8_t offset, unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask); diff --git a/hw/msix.c b/hw/msix.c index ded3c55..5f7d6d3 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned vector) return msg; } +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; + +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT; +} + /* Add MSI-X capability to the config space for the device. */ /* Given a bar and its size, add MSI-X table on top of it * and fill MSI-X capability in the config space. diff --git a/hw/msix.h b/hw/msix.h index 50aee82..26a437e 100644 --- a/hw/msix.h +++ b/hw/msix.h @@ -4,6 +4,8 @@ #include qemu-common.h #include pci.h +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg); + int msix_init(PCIDevice *pdev, unsigned short nentries, MemoryRegion *bar, unsigned bar_nr, unsigned bar_size); -- 1.7.10 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi/msix: added API to set MSI message address and data
Ping? On 22/06/12 11:15, Alexey Kardashevskiy wrote: Added (msi|msix)_set_message() function for whoever might want to use them. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- hw/msi.c | 13 + hw/msi.h |1 + hw/msix.c |9 + hw/msix.h |2 ++ 4 files changed, 25 insertions(+) diff --git a/hw/msi.c b/hw/msi.c index 5233204..cc6102f 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool msi64bit) return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32); } +void msi_set_message(PCIDevice *dev, MSIMessage msg) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data); +} + bool msi_enabled(const PCIDevice *dev) { return msi_present(dev) diff --git a/hw/msi.h b/hw/msi.h index 75747ab..6ec1f99 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -31,6 +31,7 @@ struct MSIMessage { extern bool msi_supported; +void msi_set_message(PCIDevice *dev, MSIMessage msg); bool msi_enabled(const PCIDevice *dev); int msi_init(struct PCIDevice *dev, uint8_t offset, unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask); diff --git a/hw/msix.c b/hw/msix.c index ded3c55..5f7d6d3 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned vector) return msg; } +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; + +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT; +} + /* Add MSI-X capability to the config space for the device. */ /* Given a bar and its size, add MSI-X table on top of it * and fill MSI-X capability in the config space. diff --git a/hw/msix.h b/hw/msix.h index 50aee82..26a437e 100644 --- a/hw/msix.h +++ b/hw/msix.h @@ -4,6 +4,8 @@ #include qemu-common.h #include pci.h +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg); + int msix_init(PCIDevice *pdev, unsigned short nentries, MemoryRegion *bar, unsigned bar_nr, unsigned bar_size); -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi/msix: added API to set MSI message address and data
On 18/07/12 22:43, Michael S. Tsirkin wrote: On Thu, Jun 21, 2012 at 09:39:10PM +1000, Alexey Kardashevskiy wrote: Added (msi|msix)_set_message() functions. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru So guests do enable MSI through config space, but do not fill in vectors? Yes. msix_capability_init() calls arch_setup_msi_irqs() which does everything it needs to do (i.e. calls hypervisor) before msix_capability_init() writes PCI_MSIX_FLAGS_ENABLE to the PCI_MSIX_FLAGS register. These vectors are the PCI bus addresses, the way they are set is specific for a PCI host controller, I do not see why the current scheme is a bug. Very strange. Are you sure it's not just a guest bug? How does it work for other PCI devices? Did not get the question. It works the same for every PCI device under POWER guest. Can't we just fix guest drivers to program the vectors properly? Also pls address the comment below. Comment below. Thanks! --- hw/msi.c | 13 + hw/msi.h |1 + hw/msix.c |9 + hw/msix.h |2 ++ 4 files changed, 25 insertions(+) diff --git a/hw/msi.c b/hw/msi.c index 5233204..cc6102f 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool msi64bit) return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32); } +void msi_set_message(PCIDevice *dev, MSIMessage msg) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data); +} + Please add documentation. Something like /* * Special API for POWER to configure the vectors through * a side channel. Should never be used by devices. */ It is useful for any para-virtualized environment I believe, is not it? For s390 as well. Of course, if it supports PCI, for example, what I am not sure it does though :) bool msi_enabled(const PCIDevice *dev) { return msi_present(dev) diff --git a/hw/msi.h b/hw/msi.h index 75747ab..6ec1f99 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -31,6 +31,7 @@ struct MSIMessage { extern bool msi_supported; +void msi_set_message(PCIDevice *dev, MSIMessage msg); bool msi_enabled(const PCIDevice *dev); int msi_init(struct PCIDevice *dev, uint8_t offset, unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask); diff --git a/hw/msix.c b/hw/msix.c index ded3c55..5f7d6d3 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned vector) return msg; } +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; + +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT; +} + /* Add MSI-X capability to the config space for the device. */ /* Given a bar and its size, add MSI-X table on top of it * and fill MSI-X capability in the config space. diff --git a/hw/msix.h b/hw/msix.h index 50aee82..26a437e 100644 --- a/hw/msix.h +++ b/hw/msix.h @@ -4,6 +4,8 @@ #include qemu-common.h #include pci.h +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg); + int msix_init(PCIDevice *pdev, unsigned short nentries, MemoryRegion *bar, unsigned bar_nr, unsigned bar_size); -- 1.7.10 ps. double '-' and git version is an end-of-patch scissor as I read somewhere, cannot recall where exactly :) On 21/06/12 20:56, Jan Kiszka wrote: On 2012-06-21 12:50, Alexey Kardashevskiy wrote: On 21/06/12 20:38, Jan Kiszka wrote: On 2012-06-21 12:28, Alexey Kardashevskiy wrote: On 21/06/12 17:39, Jan Kiszka wrote: On 2012-06-21 09:18, Alexey Kardashevskiy wrote: agrhhh. sha1 of the patch changed after rebasing :) Added (msi|msix)_(set|get)_message() function for whoever might want to use them. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead
Re: [PATCH] msi/msix: added API to set MSI message address and data
On 19/07/12 01:23, Michael S. Tsirkin wrote: On Wed, Jul 18, 2012 at 11:17:12PM +1000, Alexey Kardashevskiy wrote: On 18/07/12 22:43, Michael S. Tsirkin wrote: On Thu, Jun 21, 2012 at 09:39:10PM +1000, Alexey Kardashevskiy wrote: Added (msi|msix)_set_message() functions. Currently msi_notify()/msix_notify() write to these vectors to signal the guest about an interrupt so the correct values have to written there by the guest or QEMU. For example, POWER guest never initializes MSI/MSIX vectors, instead it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on POWER we have to initialize MSI/MSIX message from QEMU. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru So guests do enable MSI through config space, but do not fill in vectors? Yes. msix_capability_init() calls arch_setup_msi_irqs() which does everything it needs to do (i.e. calls hypervisor) before msix_capability_init() writes PCI_MSIX_FLAGS_ENABLE to the PCI_MSIX_FLAGS register. These vectors are the PCI bus addresses, the way they are set is specific for a PCI host controller, I do not see why the current scheme is a bug. I won't work with any real PCI device, will it? Real pci devices expect vectors to be written into their memory. Yes. And the hypervisor does this. On POWER (at least book3s - server powerpc, the whole config space kitchen is hidden behind RTAS (kind of bios). For the guest, this RTAS is implemented in hypervisor, for the host - in the system firmware. So powerpc linux does not have to have PHB drivers. Kinda cool. Usual powerpc server is running without the host linux at all, it is running a hypervisor called pHyp. And every guest knows that it is a guest, there is no full machine emulation, it is para-virtualization. In power-kvm, we replace that pHyp with the host linux and now QEMU plays a hypervisor role. Some day We will move the hypervisor to the host kernel completely (?) but now it is in QEMU. Very strange. Are you sure it's not just a guest bug? How does it work for other PCI devices? Did not get the question. It works the same for every PCI device under POWER guest. I mean for real PCI devices. Can't we just fix guest drivers to program the vectors properly? Also pls address the comment below. Comment below. Thanks! --- hw/msi.c | 13 + hw/msi.h |1 + hw/msix.c |9 + hw/msix.h |2 ++ 4 files changed, 25 insertions(+) diff --git a/hw/msi.c b/hw/msi.c index 5233204..cc6102f 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* dev, bool msi64bit) return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : PCI_MSI_PENDING_32); } +void msi_set_message(PCIDevice *dev, MSIMessage msg) +{ +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev)); +bool msi64bit = flags PCI_MSI_FLAGS_64BIT; + +if (msi64bit) { +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address); +} else { +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address); +} +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data); +} + Please add documentation. Something like /* * Special API for POWER to configure the vectors through * a side channel. Should never be used by devices. */ It is useful for any para-virtualized environment I believe, is not it? For s390 as well. Of course, if it supports PCI, for example, what I am not sure it does though :) I expect the normal guest to program the address into MSI register using config accesses, same way that it enables MSI/MSIX. Why POWER does it differently I did not yet figure out but I hope this weirdness is not so widespread. In para-virt I would expect the guest not to touch config space at all. At least it should use one interface rather than two but this is how it is. bool msi_enabled(const PCIDevice *dev) { return msi_present(dev) diff --git a/hw/msi.h b/hw/msi.h index 75747ab..6ec1f99 100644 --- a/hw/msi.h +++ b/hw/msi.h @@ -31,6 +31,7 @@ struct MSIMessage { extern bool msi_supported; +void msi_set_message(PCIDevice *dev, MSIMessage msg); bool msi_enabled(const PCIDevice *dev); int msi_init(struct PCIDevice *dev, uint8_t offset, unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask); diff --git a/hw/msix.c b/hw/msix.c index ded3c55..5f7d6d3 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, unsigned vector) return msg; } +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg) +{ +uint8_t *table_entry = dev-msix_table_page + vector * PCI_MSIX_ENTRY_SIZE; + +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address); +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data); +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL
[PATCH] powerpc-kvm: fixing page alignment for TCE
From: Paul Mackerras pau...@samba.org TODO: ask Paul to make a proper message. This is the fix for a host kernel compiled with a page size other than 4K (TCE page size). In the case of a 64K page size, the host used to lose address bits in hpte_rpn(). The patch fixes it. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/kvm/book3s_64_mmu_hv.c |9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 80a5775..a41f11b 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -503,7 +503,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, struct kvm *kvm = vcpu-kvm; unsigned long *hptep, hpte[3], r; unsigned long mmu_seq, psize, pte_size; - unsigned long gfn, hva, pfn; + unsigned long gpa, gfn, hva, pfn; struct kvm_memory_slot *memslot; unsigned long *rmap; struct revmap_entry *rev; @@ -541,15 +541,14 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, /* Translate the logical address and get the page */ psize = hpte_page_size(hpte[0], r); - gfn = hpte_rpn(r, psize); + gpa = (r HPTE_R_RPN ~(psize - 1)) | (ea (psize - 1)); + gfn = gpa PAGE_SHIFT; memslot = gfn_to_memslot(kvm, gfn); /* No memslot means it's an emulated MMIO region */ - if (!memslot || (memslot-flags KVM_MEMSLOT_INVALID)) { - unsigned long gpa = (gfn PAGE_SHIFT) | (ea (psize - 1)); + if (!memslot || (memslot-flags KVM_MEMSLOT_INVALID)) return kvmppc_hv_emulate_mmio(run, vcpu, gpa, ea, dsisr DSISR_ISSTORE); - } if (!kvm-arch.using_mmu_notifiers) return -EFAULT; /* should never get here */ -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] iommu: adding missing kvm_iommu_map_pages/kvm_iommu_unmap_pages
The IOMMU API implements groups creating/deletion, device binding and IOMMU map/unmap operations. The POWERPC implementation uses most of the API except map/unmap operations which are implemented on POWERPC using hypercalls. However in order to link a kernel with the CONFIG_IOMMU_API enabled, the empty kvm_iommu_map_pages/kvm_iommu_unmap_pages have to be defined, so does the patch. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Cc: David Gibson da...@gibson.dropbear.id.au --- arch/powerpc/kernel/iommu.c | 17 + 1 file changed, 17 insertions(+) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 31c4fdc..7c309fe 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -36,6 +36,7 @@ #include linux/hash.h #include linux/fault-inject.h #include linux/pci.h +#include linux/kvm_host.h #include asm/io.h #include asm/prom.h #include asm/iommu.h @@ -860,3 +861,19 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size, free_pages((unsigned long)vaddr, get_order(size)); } } + +#ifdef CONFIG_IOMMU_API +/* + * SPAPR TCE API + */ + +/* POWERPC does not use IOMMU API for mapping/unmapping */ +int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot) +{ + return 0; +} +void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot) +{ +} + +#endif /* CONFIG_IOMMU_API */ -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] powerpc kvm: added multiple TCEs requests support
On 15/02/13 14:24, Paul Mackerras wrote: On Mon, Feb 11, 2013 at 11:12:41PM +1100, a...@ozlabs.ru wrote: +static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt, + unsigned long ioba, unsigned long tce) +{ + unsigned long idx = ioba SPAPR_TCE_SHIFT; + struct page *page; + u64 *tbl; + + /* udbg_printf(H_PUT_TCE: liobn 0x%lx = stt=%p window_size=0x%x\n, */ + /* liobn, stt, stt-window_size); */ + if (ioba = stt-window_size) { + pr_err(%s failed on ioba=%lx\n, __func__, ioba); Doesn't this give the guest a way to spam the host logs? And in fact printk in real mode is potentially problematic. I would just leave out this statement. + return H_PARAMETER; + } + + page = stt-pages[idx / TCES_PER_PAGE]; + tbl = (u64 *)page_address(page); I would like to see an explanation of why we are confident that page_address() will work correctly in real mode, across all the combinations of config options that we can have for a ppc64 book3s kernel. It was there before this patch, I just moved it so I would think it has been explained before :) There is no combination on PPC to get WANT_PAGE_VIRTUAL enabled. CONFIG_HIGHMEM is supported for PPC32 only so HASHED_PAGE_VIRTUAL is not enabled on PPC64 either. So this definition is supposed to work on PPC64: #define page_address(page) lowmem_page_address(page) where lowmem_page_address() is arithmetic operation on a page struct address: static __always_inline void *lowmem_page_address(const struct page *page) { return __va(PFN_PHYS(page_to_pfn(page))); } PPC32 will use page_address() from mm/highmem.c, I need some lesson about memory layout in 32bit but for now I cannot see how it can possibly fail here. + + /* FIXME: Need to validate the TCE itself */ + /* udbg_printf(tce @ %p\n, tbl[idx % TCES_PER_PAGE]); */ + tbl[idx % TCES_PER_PAGE] = tce; + + return H_SUCCESS; +} + +/* + * Real mode handlers */ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { - struct kvm *kvm = vcpu-kvm; struct kvmppc_spapr_tce_table *stt; - /* udbg_printf(H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n, */ - /* liobn, ioba, tce); */ + stt = find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!stt) + return H_TOO_HARD; + + /* Emulated IO */ + return emulated_h_put_tce(stt, ioba, tce); +} + +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *stt; + long i, ret = 0; + unsigned long *tces; + + stt = find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!stt) + return H_TOO_HARD; - list_for_each_entry(stt, kvm-arch.spapr_tce_tables, list) { - if (stt-liobn == liobn) { - unsigned long idx = ioba SPAPR_TCE_SHIFT; - struct page *page; - u64 *tbl; + tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL); + if (!tces) + return H_TOO_HARD; - /* udbg_printf(H_PUT_TCE: liobn 0x%lx = stt=%p window_size=0x%x\n, */ - /* liobn, stt, stt-window_size); */ - if (ioba = stt-window_size) - return H_PARAMETER; + /* Emulated IO */ + for (i = 0; (i npages) !ret; ++i, ioba += IOMMU_PAGE_SIZE) + ret = emulated_h_put_tce(stt, ioba, tces[i]); So, tces is a pointer to somewhere inside a real page. Did we check somewhere that tces[npages-1] is in the same page as tces[0]? If so, I missed it. If we didn't, then we probably should check and do something about it. - page = stt-pages[idx / TCES_PER_PAGE]; - tbl = (u64 *)page_address(page); + return ret; +} - /* FIXME: Need to validate the TCE itself */ - /* udbg_printf(tce @ %p\n, tbl[idx % TCES_PER_PAGE]); */ - tbl[idx % TCES_PER_PAGE] = tce; - return H_SUCCESS; - } - } +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *stt; + long i, ret = 0; + + stt = find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!stt) + return H_TOO_HARD; - /* Didn't find the liobn, punt it to userspace */ - return H_TOO_HARD; + /* Emulated
[PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org --- Documentation/virtual/kvm/api.txt | 15 ++ arch/powerpc/include/asm/kvm_ppc.h | 15 +- arch/powerpc/kvm/book3s_64_vio.c| 114 +++ arch/powerpc/kvm/book3s_64_vio_hv.c | 231 +++ arch/powerpc/kvm/book3s_hv.c| 23 +++ arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 - arch/powerpc/kvm/powerpc.c |3 + include/uapi/linux/kvm.h|1 + 9 files changed, 413 insertions(+), 32 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index a4df553..f621cd6 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2463,3 +2463,18 @@ For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV: where num_sets is the tlb_sizes[] value divided by the tlb_ways[] value. - The tsize field of mas1 shall be set to 4K on TLB0, even though the hardware ignores this value for TLB0. + + +6.4 KVM_CAP_PPC_MULTITCE + +Architectures: ppc +Parameters: none +Returns: 0 on success; -1 on error + +This capability enables the guest to put/remove multiple TCE entries +per hypercall which significanly accelerates DMA operations for PPC KVM +guests. + +When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are +expected to occur rather than H_PUT_TCE which supports only one TCE entry +per call. diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 99da298..d501246 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -139,8 +139,19 @@ extern void kvmppc_xics_free(struct kvm *kvm); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, -unsigned long ioba, unsigned long tce); +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( + struct kvm_vcpu *vcpu, unsigned long liobn); +extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt, + unsigned long ioba, unsigned long tce); +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce); +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages); +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages); extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, struct kvm_allocate_rma *rma); extern struct kvmppc_linear_info *kvm_alloc_rma(void); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 72ffc89..643ac1e 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -14,6 +14,7 @@ * * Copyright 2010 Paul Mackerras, IBM Corp. pau...@au1.ibm.com * Copyright 2011 David Gibson, IBM Corporation d...@au1.ibm.com + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation a...@au1.ibm.com */ #include linux/types.h @@ -36,9 +37,14 @@ #include asm/ppc-opcode.h #include asm/kvm_host.h #include asm/udbg.h +#include asm/iommu.h #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) +#define ERROR_ADDR (~(unsigned long)0x0) +/* + * TCE tables handlers. + */ static long kvmppc_stt_npages(unsigned long window_size) { return ALIGN((window_size SPAPR_TCE_SHIFT) @@ -148,3 +154,111 @@ fail: } return ret; } + +/* + * Virtual mode handling of IOMMU map/unmap. + */ +/* Converts guest physical address into host virtual */ +static unsigned long get_virt_address(struct kvm_vcpu *vcpu, + unsigned long gpa) +{ + unsigned long hva, gfn = gpa PAGE_SHIFT
[PATCH 3/6] powerpc: Prepare to support kernel handling of IOMMU map/unmap
The current VFIO-on-POWER implementation supports only user mode driven mapping, i.e. QEMU is sending requests to map/unmap pages. However this approach is really slow, so we want to move that to KVM. Since H_PUT_TCE can be extremely performance sensitive (especially with network adapters where each packet needs to be mapped/unmapped) we chose to implement that as a fast hypercall directly in real mode (processor still in the guest context but MMU off). To be able to do that, we need to provide some facilities to access the struct page count within that real mode environment as things like the sparsemem vmemmap mappings aren't accessible. This adds an API to increment/decrement page counter as get_user_pages API used for user mode mapping does not work in the real mode. CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Reviewed-by: Paul Mackerras pau...@samba.org Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/pgtable-ppc64.h |4 ++ arch/powerpc/mm/init_64.c| 77 +- 2 files changed, 80 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index 0182c20..4c56ede 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -377,6 +377,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, } #endif /* !CONFIG_HUGETLB_PAGE */ +struct page *realmode_pfn_to_page(unsigned long pfn); +int realmode_get_page(struct page *page); +int realmode_put_page(struct page *page); + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */ diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index 95a4529..838b8ae 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -297,5 +297,80 @@ int __meminit vmemmap_populate(struct page *start_page, return 0; } -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ +/* + * We do not have access to the sparsemem vmemmap, so we fallback to + * walking the list of sparsemem blocks which we already maintain for + * the sake of crashdump. In the long run, we might want to maintain + * a tree if performance of that linear walk becomes a problem. + * + * Any of realmode_ functions can fail due to: + * 1) As real sparsemem blocks do not lay in RAM continously (they + * are in virtual address space which is not available in the real mode), + * the requested page struct can be split between blocks so get_page/put_page + * may fail. + * 2) When huge pages are used, the get_page/put_page API will fail + * in real mode as the linked addresses in the page struct are virtual + * too. + * When 1) or 2) takes place, the API returns an error code to cause + * an exit to kernel virtual mode where the operation will be completed. + */ +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct vmemmap_backing *vmem_back; + struct page *page; + unsigned long page_size = 1 mmu_psize_defs[mmu_vmemmap_psize].shift; + unsigned long pg_va = (unsigned long) pfn_to_page(pfn); + + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back-list) { + if (pg_va vmem_back-virt_addr) + continue; + + /* Check that page struct is not split between real pages */ + if ((pg_va + sizeof(struct page)) + (vmem_back-virt_addr + page_size)) + return NULL; + + page = (struct page *) (vmem_back-phys + pg_va - + vmem_back-virt_addr); + return page; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#elif defined(CONFIG_FLATMEM) + +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + return page; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */ + +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) +int realmode_get_page(struct page *page) +{ + if (PageTail(page)) + return -EAGAIN; + + get_page(page); + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_get_page); + +int realmode_put_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN; + + put_page(page); + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_put_page); +#endif -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
This adds special support for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h | 24 +++ arch/powerpc/kvm/book3s_64_vio.c| 79 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 47 - 4 files changed, 149 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 2b70cbc..b6a047e 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -180,6 +180,8 @@ struct kvmppc_spapr_tce_table { u32 window_size; bool virtmode_only; struct iommu_group *grp;/* used for IOMMU groups */ + struct list_head hugepages; /* used for IOMMU groups */ + spinlock_t hugepages_lock; /* used for IOMMU groups */ struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index bdfa140..3c95464 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -154,6 +154,30 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce_value, unsigned long npages); + +/* + * The KVM guest can be backed with 16MB pages (qemu switch + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/). + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +struct iommu_kvmppc_hugepage { + struct list_head list; + pte_t pte; /* Huge page PTE */ + unsigned long pa; /* Base phys address used as a real TCE */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; +extern struct iommu_kvmppc_hugepage *kvmppc_iommu_hugepage_find( + struct kvmppc_spapr_tce_table *tt, pte_t pte); + extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, struct kvm_allocate_rma *rma); extern struct kvmppc_linear_info *kvm_alloc_rma(void); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 98cf949..274458d 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -54,6 +54,59 @@ static bool kvmppc_tce_virt_only = false; module_param_named(virt_only, kvmppc_tce_virt_only, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(virt_only, Disable realmode handling of IOMMU map/unmap); +#ifdef CONFIG_IOMMU_API +/* + * Adds a new huge page descriptor to the list. + */ +static struct iommu_kvmppc_hugepage *kvmppc_iommu_hugepage_add( + struct kvmppc_spapr_tce_table *tt, + pte_t pte, unsigned long va, unsigned long pg_size) +{ + int ret; + struct iommu_kvmppc_hugepage *hp; + struct page *p; + + va = va ~(pg_size - 1); + ret = get_user_pages_fast(va, 1, true/*write*/, p); + if ((ret != 1) || !p) + return NULL; + + hp = kzalloc(sizeof(*hp), GFP_KERNEL); + if (!hp) + return NULL; + + hp-page = p; + hp-pte = pte; + hp-pa = __pa((unsigned long) page_address(hp-page)); + hp-size = pg_size; + + spin_lock(tt-hugepages_lock); + list_add(hp-list, tt-hugepages); + spin_unlock(tt-hugepages_lock); + + return hp; +} + +static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt) +{ + INIT_LIST_HEAD(tt-hugepages); + spin_lock_init(tt-hugepages_lock); +} + +static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt) +{ + struct iommu_kvmppc_hugepage *hp, *tmp; + + spin_lock(tt-hugepages_lock); + list_for_each_entry_safe(hp, tmp, tt-hugepages, list
[PATCH 0/6] KVM: PPC: IOMMU in-kernel handling
This series is supposed to accelerate IOMMU operations in real and virtual mode in the host kernel for the KVM guest. The first user is VFIO however this series does not contain any VFIO related code as the connection between VFIO and the new handlers is to be made in QEMU via ioctl to the KVM fd. Although the series compiles, it does not make sense without VFIO patches which are posted separately. The iommu: Add a function to find an iommu group by id patch has already gone to linux-next (from iommu tree) but it is not in upstream yet so I am including it here for the reference. Alexey Kardashevskiy (6): KVM: PPC: Make lookup_linux_pte public KVM: PPC: Add support for multiple-TCE hcalls powerpc: Prepare to support kernel handling of IOMMU map/unmap iommu: Add a function to find an iommu group by id KVM: PPC: Add support for IOMMU in-kernel handling KVM: PPC: Add hugepage support for IOMMU in-kernel handling Documentation/virtual/kvm/api.txt| 43 +++ arch/powerpc/include/asm/kvm_host.h |4 + arch/powerpc/include/asm/kvm_ppc.h | 44 ++- arch/powerpc/include/asm/pgtable-ppc64.h |4 + arch/powerpc/include/uapi/asm/kvm.h |7 + arch/powerpc/kvm/book3s_64_vio.c | 433 +++- arch/powerpc/kvm/book3s_64_vio_hv.c | 464 -- arch/powerpc/kvm/book3s_hv.c | 23 ++ arch/powerpc/kvm/book3s_hv_rm_mmu.c |5 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 + arch/powerpc/kvm/book3s_pr_papr.c| 37 ++- arch/powerpc/kvm/powerpc.c | 15 + arch/powerpc/mm/init_64.c| 77 - drivers/iommu/iommu.c| 29 ++ include/linux/iommu.h|1 + include/uapi/linux/kvm.h |3 + 16 files changed, 1159 insertions(+), 36 deletions(-) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests without passing them to QEMU, which should save time on switching to QEMU and back. Both real and virtual modes are supported - whenever the kernel fails to handle TCE request, it passes it to the virtual mode. If it the virtual mode handlers fail, then the request is passed to the user mode, for example, to QEMU. This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables in-kernel handling of IOMMU map/unmap. This adds a special case for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. This also adds the virt_only parameter to the KVM module for debug and performance check purposes. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org --- Documentation/virtual/kvm/api.txt | 28 arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h |2 + arch/powerpc/include/uapi/asm/kvm.h |7 + arch/powerpc/kvm/book3s_64_vio.c| 242 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 192 +++ arch/powerpc/kvm/powerpc.c | 12 ++ include/uapi/linux/kvm.h|2 + 8 files changed, 485 insertions(+), 2 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index f621cd6..2039767 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_SPAPR_TCE_IOMMU + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce_iommu (in) +Returns: 0 on success, -1 on error + +This creates a link between IOMMU group and a hardware TCE (translation +control entry) table. This link lets the host kernel know what IOMMU +group (i.e. TCE table) to use for the LIOBN number passed with +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. + +/* for KVM_CAP_SPAPR_TCE_IOMMU */ +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; + __u32 flags; +}; + +No flag is supported at the moment. + +When the guest issues TCE call on a liobn for which a TCE table has been +registered, the kernel will handle it in real mode, updating the hardware +TCE table. TCE table calls for other liobns will cause a vm exit and must +be handled by userspace. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 36ceb0d..2b70cbc 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table { struct kvm *kvm; u64 liobn; u32 window_size; + bool virtmode_only; + struct iommu_group *grp;/* used for IOMMU groups */ struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index d501246..bdfa140 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args); extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( struct kvm_vcpu *vcpu, unsigned long liobn); extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt, diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 681b314..b67d44b 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce { __u32 window_size; }; +/* for KVM_CAP_SPAPR_TCE_IOMMU */ +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; + __u32 flags; +}; + /* for KVM_ALLOCATE_RMA
[PATCH 4/6] iommu: Add a function to find an iommu group by id
As IOMMU groups are exposed to the user space by their numbers, the user space can use them in various kernel APIs so the kernel might need an API to find a group by its ID. As an example, QEMU VFIO on PPC64 platform needs it to associate a logical bus number (LIOBN) with a specific IOMMU group in order to support in-kernel handling of DMA map/unmap requests. This adds the iommu_group_get_by_id(id) function which performs this search. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org --- drivers/iommu/iommu.c | 29 + include/linux/iommu.h |1 + 2 files changed, 30 insertions(+) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index ddbdaca..5514dfa 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -204,6 +204,35 @@ again: } EXPORT_SYMBOL_GPL(iommu_group_alloc); +struct iommu_group *iommu_group_get_by_id(int id) +{ + struct kobject *group_kobj; + struct iommu_group *group; + const char *name; + + if (!iommu_group_kset) + return NULL; + + name = kasprintf(GFP_KERNEL, %d, id); + if (!name) + return NULL; + + group_kobj = kset_find_obj(iommu_group_kset, name); + kfree(name); + + if (!group_kobj) + return NULL; + + group = container_of(group_kobj, struct iommu_group, kobj); + BUG_ON(group-id != id); + + kobject_get(group-devices_kobj); + kobject_put(group-kobj); + + return group; +} +EXPORT_SYMBOL_GPL(iommu_group_get_by_id); + /** * iommu_group_get_iommudata - retrieve iommu_data registered for a group * @group: the group diff --git a/include/linux/iommu.h b/include/linux/iommu.h index f3b99e1..00e5d7d 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -113,6 +113,7 @@ struct iommu_ops { extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops); extern bool iommu_present(struct bus_type *bus); extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus); +extern struct iommu_group *iommu_group_get_by_id(int id); extern void iommu_domain_free(struct iommu_domain *domain); extern int iommu_attach_device(struct iommu_domain *domain, struct device *dev); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] KVM: PPC: Make lookup_linux_pte public
The lookup_linux_pte() function returns a linux PTE which is needed in the process of converting KVM guest physical address into host real address in real mode. This conversion will be used by upcoming support of H_PUT_TCE_INDIRECT, as the TCE list address comes from the guest and is a guest physical address. This makes lookup_linux_pte() public so that code can call it. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_ppc.h |3 +++ arch/powerpc/kvm/book3s_hv_rm_mmu.c |5 +++-- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 41426c9..99da298 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -379,4 +379,7 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu *vcpu, int ra, int rb) return ea; } +pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva, + int writing, unsigned long *pte_sizep); + #endif /* __POWERPC_KVM_PPC_H__ */ diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 6dcbb49..18fc382 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -134,8 +134,8 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index, unlock_rmap(rmap); } -static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva, - int writing, unsigned long *pte_sizep) +pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva, + int writing, unsigned long *pte_sizep) { pte_t *ptep; unsigned long ps = *pte_sizep; @@ -154,6 +154,7 @@ static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva, return __pte(0); return kvmppc_read_update_linux_pte(ptep, writing); } +EXPORT_SYMBOL_GPL(lookup_linux_pte); static inline void unlock_hpte(unsigned long *hpte, unsigned long hpte_v) { -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: PPC: iommu: Add missing kvm_iommu_map_pages/kvm_iommu_unmap_pages
On 05/07/2013 07:07 AM, Alex Williamson wrote: On Mon, 2013-05-06 at 17:21 +1000, a...@ozlabs.ru wrote: From: Alexey Kardashevskiy a...@ozlabs.ru The IOMMU API implements groups creating/deletion, device binding and IOMMU map/unmap operations. The PowerPC implementation uses most of the API except map/unmap operations, which are implemented on POWER using hypercalls. However, in order to link a kernel with the CONFIG_IOMMU_API enabled, the empty kvm_iommu_map_pages/kvm_iommu_unmap_pages have to be defined, so this defines them. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_host.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index b6a047e..c025d91 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -603,4 +603,18 @@ struct kvm_vcpu_arch { #define __KVM_HAVE_ARCH_WQP +#ifdef CONFIG_IOMMU_API +/* POWERPC does not use IOMMU API for mapping/unmapping */ +static inline int kvm_iommu_map_pages(struct kvm *kvm, +struct kvm_memory_slot *slot) +{ +return 0; +} + +static inline void kvm_iommu_unmap_pages(struct kvm *kvm, +struct kvm_memory_slot *slot) +{ +} +#endif /* CONFIG_IOMMU_API */ + #endif /* __POWERPC_KVM_HOST_H__ */ This is no longer needed, Gleb applied my patch for 3.10 that make all of KVM device assignment dependent on a build config option and the top level kvm_host.h now includes this when that is not set. Thanks, Cannot find it, could you point me please where it is on github or git.kernel.org? Thanks. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: PPC: iommu: Add missing kvm_iommu_map_pages/kvm_iommu_unmap_pages
On 05/07/2013 11:42 AM, Alex Williamson wrote: On Tue, 2013-05-07 at 10:49 +1000, Alexey Kardashevskiy wrote: On 05/07/2013 07:07 AM, Alex Williamson wrote: On Mon, 2013-05-06 at 17:21 +1000, a...@ozlabs.ru wrote: From: Alexey Kardashevskiy a...@ozlabs.ru The IOMMU API implements groups creating/deletion, device binding and IOMMU map/unmap operations. The PowerPC implementation uses most of the API except map/unmap operations, which are implemented on POWER using hypercalls. However, in order to link a kernel with the CONFIG_IOMMU_API enabled, the empty kvm_iommu_map_pages/kvm_iommu_unmap_pages have to be defined, so this defines them. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm_host.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index b6a047e..c025d91 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -603,4 +603,18 @@ struct kvm_vcpu_arch { #define __KVM_HAVE_ARCH_WQP +#ifdef CONFIG_IOMMU_API +/* POWERPC does not use IOMMU API for mapping/unmapping */ +static inline int kvm_iommu_map_pages(struct kvm *kvm, + struct kvm_memory_slot *slot) +{ + return 0; +} + +static inline void kvm_iommu_unmap_pages(struct kvm *kvm, + struct kvm_memory_slot *slot) +{ +} +#endif /* CONFIG_IOMMU_API */ + #endif /* __POWERPC_KVM_HOST_H__ */ This is no longer needed, Gleb applied my patch for 3.10 that make all of KVM device assignment dependent on a build config option and the top level kvm_host.h now includes this when that is not set. Thanks, Cannot find it, could you point me please where it is on github or git.kernel.org? Thanks. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2a5bab1004729f3302c776e53ee7c895b98bb1ce Yes, I confirm, this is patch is not need any more. Thanks! -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
On 05/07/2013 04:02 PM, David Gibson wrote: On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote: On 05/07/2013 03:29 PM, David Gibson wrote: On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote: This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests without passing them to QEMU, which should save time on switching to QEMU and back. Both real and virtual modes are supported - whenever the kernel fails to handle TCE request, it passes it to the virtual mode. If it the virtual mode handlers fail, then the request is passed to the user mode, for example, to QEMU. This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables in-kernel handling of IOMMU map/unmap. This adds a special case for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. This also adds the virt_only parameter to the KVM module for debug and performance check purposes. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org --- Documentation/virtual/kvm/api.txt | 28 arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h |2 + arch/powerpc/include/uapi/asm/kvm.h |7 + arch/powerpc/kvm/book3s_64_vio.c| 242 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 192 +++ arch/powerpc/kvm/powerpc.c | 12 ++ include/uapi/linux/kvm.h|2 + 8 files changed, 485 insertions(+), 2 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index f621cd6..2039767 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_SPAPR_TCE_IOMMU + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce_iommu (in) +Returns: 0 on success, -1 on error + +This creates a link between IOMMU group and a hardware TCE (translation +control entry) table. This link lets the host kernel know what IOMMU +group (i.e. TCE table) to use for the LIOBN number passed with +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. + +/* for KVM_CAP_SPAPR_TCE_IOMMU */ +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; Wouldn't it be more in keeping pardon? Sorry, I was going to suggest a change, but then realised it wasn't actually any better than what you have now. + __u32 flags; +}; + +No flag is supported at the moment. + +When the guest issues TCE call on a liobn for which a TCE table has been +registered, the kernel will handle it in real mode, updating the hardware +TCE table. TCE table calls for other liobns will cause a vm exit and must +be handled by userspace. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 36ceb0d..2b70cbc 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table { struct kvm *kvm; u64 liobn; u32 window_size; + bool virtmode_only; I see this is now initialized from the global parameter, but I think it would be better to just check the global (debug) parameter directly, rather than duplicating it here. The global parameter is in kvm.ko and the struct above is in the real mode part which cannot go to the module. Ah, ok. I'm half inclined to just drop the virtmode_only thing entirely. + struct iommu_group *grp;/* used for IOMMU groups */ struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index d501246..bdfa140 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct
Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
On 05/10/2013 04:51 PM, David Gibson wrote: On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote: This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Hrm. Clearly I didn't read this carefully enough before. There are some problems here. ? [snip] diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 72ffc89..643ac1e 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -14,6 +14,7 @@ * * Copyright 2010 Paul Mackerras, IBM Corp. pau...@au1.ibm.com * Copyright 2011 David Gibson, IBM Corporation d...@au1.ibm.com + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation a...@au1.ibm.com */ #include linux/types.h @@ -36,9 +37,14 @@ #include asm/ppc-opcode.h #include asm/kvm_host.h #include asm/udbg.h +#include asm/iommu.h #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) +#define ERROR_ADDR (~(unsigned long)0x0) +/* + * TCE tables handlers. + */ static long kvmppc_stt_npages(unsigned long window_size) { return ALIGN((window_size SPAPR_TCE_SHIFT) @@ -148,3 +154,111 @@ fail: } return ret; } + +/* + * Virtual mode handling of IOMMU map/unmap. + */ +/* Converts guest physical address into host virtual */ +static unsigned long get_virt_address(struct kvm_vcpu *vcpu, +unsigned long gpa) This should probably return a void * rather than an unsigned long. Well, actually a void __user *. +{ +unsigned long hva, gfn = gpa PAGE_SHIFT; +struct kvm_memory_slot *memslot; + +memslot = search_memslots(kvm_memslots(vcpu-kvm), gfn); +if (!memslot) +return ERROR_ADDR; + +/* + * Convert gfn to hva preserving flags and an offset + * within a system page + */ +hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa ~PAGE_MASK); +return hva; +} + +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, +unsigned long liobn, unsigned long ioba, +unsigned long tce) +{ +struct kvmppc_spapr_tce_table *tt; + +tt = kvmppc_find_tce_table(vcpu, liobn); +/* Didn't find the liobn, put it to userspace */ +if (!tt) +return H_TOO_HARD; + +/* Emulated IO */ +return kvmppc_emulated_h_put_tce(tt, ioba, tce); +} + +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, +unsigned long liobn, unsigned long ioba, +unsigned long tce_list, unsigned long npages) +{ +struct kvmppc_spapr_tce_table *tt; +long i; +unsigned long tces; + +/* The whole table addressed by tce_list resides in 4K page */ +if (npages 512) +return H_PARAMETER; So, that doesn't actually verify what the comment says it does - only that the list is 4K in total. You need to check the alignment of tce_list as well. The spec says to return H_PARAMETER if 512. I.e. it takes just 1 page and I do not need to bother if pages may not lay continuously in RAM (matters for real mode). /* * As the spec is saying that maximum possible number of TCEs is 512, * the whole TCE page is no more than 4K. Therefore we do not have to * worry if pages do not lie continuously in the RAM */ Any better?... + +tt = kvmppc_find_tce_table(vcpu, liobn); +/* Didn't find the liobn, put it to userspace */ +if (!tt) +return H_TOO_HARD; + +tces = get_virt_address(vcpu, tce_list); +if (tces == ERROR_ADDR) +return H_TOO_HARD; + +/* Emulated IO */ This comment doesn't seem to have any bearing on the test which follows it. +if ((ioba + (npages IOMMU_PAGE_SHIFT)) tt-window_size) +return H_PARAMETER; + +for (i = 0; i npages; ++i) { +unsigned long tce; +unsigned long ptce = tces + i * sizeof(unsigned long); + +if (get_user(tce, (unsigned long __user *)ptce)) +break; + +if (kvmppc_emulated_h_put_tce(tt, +ioba + (i IOMMU_PAGE_SHIFT), tce)) +break; +} +if (i == npages) +return H_SUCCESS; + +/* Failed, do cleanup */ +do { +--i
[PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org --- Changelog: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) --- Documentation/virtual/kvm/api.txt | 14 ++ arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c| 118 ++ arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++ arch/powerpc/kvm/book3s_hv.c| 39 + arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 - arch/powerpc/kvm/powerpc.c |3 + include/uapi/linux/kvm.h|1 + 10 files changed, 470 insertions(+), 32 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 5f91eda..3c7c7ea 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2780,3 +2780,17 @@ Parameters: args[0] is the XICS device fd args[1] is the XICS CPU number (server ID) for this vcpu This capability connects the vcpu to an in-kernel XICS device. + +6.8 KVM_CAP_PPC_MULTITCE + +Architectures: ppc +Parameters: none +Returns: 0 on success; -1 on error + +This capability enables the guest to put/remove multiple TCE entries +per hypercall which significanly accelerates DMA operations for PPC KVM +guests. + +When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are +expected to occur rather than H_PUT_TCE which supports only one TCE entry +per call. diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index af326cd..85d8f26 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { spinlock_t tbacct_lock; u64 busy_stolen; u64 busy_preempt; + + unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hall */ #endif }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index a5287fe..e852921b 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, -unsigned long ioba, unsigned long tce); +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( + struct kvm_vcpu *vcpu, unsigned long liobn); +extern long kvmppc_emulated_validate_tce(unsigned long tce); +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, + unsigned long ioba, unsigned long tce); +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce); +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages); +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages); extern long kvm_vm_ioctl_allocate_rma
[PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap
The current VFIO-on-POWER implementation supports only user mode driven mapping, i.e. QEMU is sending requests to map/unmap pages. However this approach is really slow, so we want to move that to KVM. Since H_PUT_TCE can be extremely performance sensitive (especially with network adapters where each packet needs to be mapped/unmapped) we chose to implement that as a fast hypercall directly in real mode (processor still in the guest context but MMU off). To be able to do that, we need to provide some facilities to access the struct page count within that real mode environment as things like the sparsemem vmemmap mappings aren't accessible. This adds an API to increment/decrement page counter as get_user_pages API used for user mode mapping does not work in the real mode. CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Reviewed-by: Paul Mackerras pau...@samba.org Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Paul Mackerras pau...@samba.org --- Changes: 2013-05-20: * PageTail() is replaced by PageCompound() in order to have the same checks for whether the page is huge in realmode_get_page() and realmode_put_page() --- arch/powerpc/include/asm/pgtable-ppc64.h |4 ++ arch/powerpc/mm/init_64.c| 77 +- 2 files changed, 80 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index e3d55f6f..7b46e5f 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, } #endif /* !CONFIG_HUGETLB_PAGE */ +struct page *realmode_pfn_to_page(unsigned long pfn); +int realmode_get_page(struct page *page); +int realmode_put_page(struct page *page); + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */ diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index c2787bf..ba6cf9b 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -296,5 +296,80 @@ void vmemmap_free(unsigned long start, unsigned long end) { } -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ +/* + * We do not have access to the sparsemem vmemmap, so we fallback to + * walking the list of sparsemem blocks which we already maintain for + * the sake of crashdump. In the long run, we might want to maintain + * a tree if performance of that linear walk becomes a problem. + * + * Any of realmode_ functions can fail due to: + * 1) As real sparsemem blocks do not lay in RAM continously (they + * are in virtual address space which is not available in the real mode), + * the requested page struct can be split between blocks so get_page/put_page + * may fail. + * 2) When huge pages are used, the get_page/put_page API will fail + * in real mode as the linked addresses in the page struct are virtual + * too. + * When 1) or 2) takes place, the API returns an error code to cause + * an exit to kernel virtual mode where the operation will be completed. + */ +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct vmemmap_backing *vmem_back; + struct page *page; + unsigned long page_size = 1 mmu_psize_defs[mmu_vmemmap_psize].shift; + unsigned long pg_va = (unsigned long) pfn_to_page(pfn); + + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back-list) { + if (pg_va vmem_back-virt_addr) + continue; + /* Check that page struct is not split between real pages */ + if ((pg_va + sizeof(struct page)) + (vmem_back-virt_addr + page_size)) + return NULL; + + page = (struct page *) (vmem_back-phys + pg_va - + vmem_back-virt_addr); + return page; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#elif defined(CONFIG_FLATMEM) + +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + return page; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */ + +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) +int realmode_get_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN; + + get_page(page); + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_get_page); + +int realmode_put_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN; + + if (!atomic_add_unless(page-_count, -1, 1)) + return -EAGAIN; + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_put_page); +#endif -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http
[PATCH 0/4 v2] KVM: PPC: IOMMU in-kernel handling
This accelerates IOMMU operations in real and virtual mode in the host kernel for the KVM guest. The first patch with multitce support is useful for emulated devices as is. The other patches are designed for VFIO although this series does not contain any VFIO related code as the connection between VFIO and the new handlers is to be made in QEMU via ioctl to the KVM fd. The series was made and tested against v3.10-rc1. Alexey Kardashevskiy (4): KVM: PPC: Add support for multiple-TCE hcalls powerpc: Prepare to support kernel handling of IOMMU map/unmap KVM: PPC: Add support for IOMMU in-kernel handling KVM: PPC: Add hugepage support for IOMMU in-kernel handling Documentation/virtual/kvm/api.txt| 42 +++ arch/powerpc/include/asm/kvm_host.h |7 + arch/powerpc/include/asm/kvm_ppc.h | 40 ++- arch/powerpc/include/asm/pgtable-ppc64.h |4 + arch/powerpc/include/uapi/asm/kvm.h |7 + arch/powerpc/kvm/book3s_64_vio.c | 398 - arch/powerpc/kvm/book3s_64_vio_hv.c | 471 -- arch/powerpc/kvm/book3s_hv.c | 39 +++ arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 + arch/powerpc/kvm/book3s_pr_papr.c| 37 ++- arch/powerpc/kvm/powerpc.c | 15 + arch/powerpc/mm/init_64.c| 77 - include/uapi/linux/kvm.h |5 + 13 files changed, 1120 insertions(+), 28 deletions(-) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests without passing them to QEMU, which should save time on switching to QEMU and back. Both real and virtual modes are supported - whenever the kernel fails to handle TCE request, it passes it to the virtual mode. If it the virtual mode handlers fail, then the request is passed to the user mode, for example, to QEMU. This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables in-kernel handling of IOMMU map/unmap. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org --- Changes: 2013-05-20: * removed get_user() from real mode handlers * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there translated TCEs, tries realmode_get_page() on those and if it fails, it passes control over the virtual mode handler which tries to finish the request handling * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit on a page * The only reason to pass the request to user mode now is when the user mode did not register TCE table in the kernel, in all other cases the virtual mode handler is expected to do the job --- Documentation/virtual/kvm/api.txt | 28 + arch/powerpc/include/asm/kvm_host.h |3 + arch/powerpc/include/asm/kvm_ppc.h |2 + arch/powerpc/include/uapi/asm/kvm.h |7 ++ arch/powerpc/kvm/book3s_64_vio.c| 198 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 193 +- arch/powerpc/kvm/powerpc.c | 12 +++ include/uapi/linux/kvm.h|4 + 8 files changed, 441 insertions(+), 6 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 3c7c7ea..3c8e9fe 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,34 @@ calls by the guest for that service will be passed to userspace to be handled. +4.79 KVM_CREATE_SPAPR_TCE_IOMMU + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce_iommu (in) +Returns: 0 on success, -1 on error + +This creates a link between IOMMU group and a hardware TCE (translation +control entry) table. This link lets the host kernel know what IOMMU +group (i.e. TCE table) to use for the LIOBN number passed with +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. + +/* for KVM_CAP_SPAPR_TCE_IOMMU */ +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; + __u32 flags; +}; + +No flag is supported at the moment. + +When the guest issues TCE call on a liobn for which a TCE table has been +registered, the kernel will handle it in real mode, updating the hardware +TCE table. TCE table calls for other liobns will cause a vm exit and must +be handled by userspace. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 85d8f26..ac0e2fe 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table { struct kvm *kvm; u64 liobn; u32 window_size; + struct iommu_group *grp;/* used for IOMMU groups */ struct page *pages[0]; }; @@ -611,6 +612,8 @@ struct kvm_vcpu_arch { u64 busy_preempt; unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hall */ + unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */ + unsigned long tce_reason; /* The reason of switching to the virtmode */ #endif }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index e852921b..934e01d 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args); extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( struct kvm_vcpu *vcpu, unsigned long liobn); extern long kvmppc_emulated_validate_tce(unsigned long tce); diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 0fb1a6e..cf82af4 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -319,6 +319,13 @@ struct kvm_create_spapr_tce { __u32 window_size; }; +/* for KVM_CAP_SPAPR_TCE_IOMMU
[PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
This adds special support for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org --- Changes: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). --- arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h | 22 + arch/powerpc/kvm/book3s_64_vio.c| 88 +-- arch/powerpc/kvm/book3s_64_vio_hv.c | 40 ++-- 4 files changed, 146 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index ac0e2fe..4fc0865 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table { u64 liobn; u32 window_size; struct iommu_group *grp;/* used for IOMMU groups */ + struct list_head hugepages; /* used for IOMMU groups */ + spinlock_t hugepages_lock; /* used for IOMMU groups */ struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 934e01d..9054df0 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce_value, unsigned long npages); + +/* + * The KVM guest can be backed with 16MB pages (qemu switch + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/). + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +struct kvmppc_iommu_hugepage { + struct list_head list; + pte_t pte; /* Huge page PTE */ + unsigned long gpa; /* Guest physical address */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; + extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, struct kvm_allocate_rma *rma); extern struct kvmppc_linear_info *kvm_alloc_rma(void); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index ffb4698..c34d63a 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -45,6 +45,71 @@ #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) #define ERROR_ADDR ((void *)~(unsigned long)0x0) +#ifdef CONFIG_IOMMU_API +/* Adds a new huge page descriptor to the list */ +static long kvmppc_iommu_hugepage_try_add( + struct kvmppc_spapr_tce_table *tt, + pte_t pte, unsigned long hva, unsigned long gpa, + unsigned long pg_size) +{ + long ret = 0; + struct kvmppc_iommu_hugepage *hp; + struct page *p; + + spin_lock(tt-hugepages_lock); + list_for_each_entry(hp, tt-hugepages, list) { + if (hp-pte == pte) + goto unlock_exit; + } + + hva = hva ~(pg_size - 1); + ret = get_user_pages_fast(hva, 1, true/*write*/, p); + if ((ret != 1) || !p) { + ret = -EFAULT; + goto unlock_exit; + } + ret = 0; + + hp = kzalloc(sizeof(*hp), GFP_KERNEL); + if (!hp) { + ret = -ENOMEM; + goto unlock_exit; + } + + hp-page = p; + hp-pte = pte; + hp-gpa = gpa ~(pg_size - 1); + hp-size = pg_size; + + list_add(hp-list, tt-hugepages); + +unlock_exit
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On 05/25/2013 12:45 PM, David Gibson wrote: On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote: On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote: diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 8465c2a..da6bf61 100644 --- a/arch/powerpc/kvm/powerpc.c @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext) +++ b/arch/powerpc/kvm/powerpc.c break; #endif case KVM_CAP_SPAPR_MULTITCE: + case KVM_CAP_SPAPR_TCE_IOMMU: r = 1; break; default: Don't advertise SPAPR capabilities if it's not book3s -- and probably there's some additional limitation that would be appropriate. So, in the case of MULTITCE, that's not quite right. PR KVM can emulate a PAPR system on a BookE machine, and there's no reason not to allow TCE acceleration as well. We can't make it dependent on PAPR mode being selected, because that's enabled per-vcpu, whereas these capabilities are queried on the VM before the vcpus are created. CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable host side hardware (i.e. a PAPR style IOMMU), though. The capability says that the ioctl is supported. If there is no IOMMU group registered, than it will fail with a reasonable error and nobody gets hurt. What is the problem? @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_create_spapr_tce(kvm, create_tce); goto out; } + case KVM_CREATE_SPAPR_TCE_IOMMU: { + struct kvm_create_spapr_tce_iommu create_tce_iommu; + struct kvm *kvm = filp-private_data; + + r = -EFAULT; + if (copy_from_user(create_tce_iommu, argp, + sizeof(create_tce_iommu))) + goto out; + r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, create_tce_iommu); + goto out; + } #endif /* CONFIG_PPC_BOOK3S_64 */ #ifdef CONFIG_KVM_BOOK3S_64_HV diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 5a2afda..450c82a 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_SPAPR_MULTITCE (0x11 + 89) +#define KVM_CAP_SPAPR_TCE_IOMMU (0x11 + 90) Hmm... Ah, yeah, that needs to be fixed. Those were interim numbers so that we didn't have to keep changing our internal trees as new upstream ioctls got added to the list. We need to get a proper number for the merge, though. @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping { #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr) #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr) +/* ioctl for SPAPR TCE IOMMU */ +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct kvm_create_spapr_tce_iommu) Shouldn't this go under the vm ioctl section? The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated devices) is in this section so I decided to keep them together. Wrong? -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On 05/27/2013 08:23 PM, Paolo Bonzini wrote: Il 25/05/2013 04:45, David Gibson ha scritto: + case KVM_CREATE_SPAPR_TCE_IOMMU: { + struct kvm_create_spapr_tce_iommu create_tce_iommu; + struct kvm *kvm = filp-private_data; + + r = -EFAULT; + if (copy_from_user(create_tce_iommu, argp, + sizeof(create_tce_iommu))) + goto out; + r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, create_tce_iommu); + goto out; + } Would it make sense to make this the only interface for creating TCEs? That is, pass both a window_size and an IOMMU group id (or e.g. -1 for no hardware IOMMU usage), and have a single ioctl for both cases? There's some duplicated code between kvm_vm_ioctl_create_spapr_tce and kvm_vm_ioctl_create_spapr_tce_iommu. Just few bits. Is there really much sense in making one function from those two? I tried, looked a bit messy. KVM_CREATE_SPAPR_TCE could stay for backwards-compatibility, or you could just use a new capability and drop the old ioctl. The old capability+ioctl already exist for quite a while and few QEMU versions supporting it were released so we do not want just drop it. So then what is the benefit of having a new interface with support of both types? I'm not sure whether you're already considering the ABI to be stable for kvmppc. Is any bit of KVM using it? Cannot see from Documentation/ABI. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On 05/29/2013 03:45 AM, Scott Wood wrote: On 05/26/2013 09:44:24 PM, Alexey Kardashevskiy wrote: On 05/25/2013 12:45 PM, David Gibson wrote: On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote: On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote: diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 8465c2a..da6bf61 100644 --- a/arch/powerpc/kvm/powerpc.c @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext) +++ b/arch/powerpc/kvm/powerpc.c break; #endif case KVM_CAP_SPAPR_MULTITCE: +case KVM_CAP_SPAPR_TCE_IOMMU: r = 1; break; default: Don't advertise SPAPR capabilities if it's not book3s -- and probably there's some additional limitation that would be appropriate. So, in the case of MULTITCE, that's not quite right. PR KVM can emulate a PAPR system on a BookE machine, and there's no reason not to allow TCE acceleration as well. We can't make it dependent on PAPR mode being selected, because that's enabled per-vcpu, whereas these capabilities are queried on the VM before the vcpus are created. CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable host side hardware (i.e. a PAPR style IOMMU), though. The capability says that the ioctl is supported. If there is no IOMMU group registered, than it will fail with a reasonable error and nobody gets hurt. What is the problem? You could say that about a lot of the capabilities that just advertise the existence of new ioctls. :-) Sometimes it's nice to know in advance whether it's supported, before actually requesting that something happen. Yes, would be nice. There is just no quick way to know if this real system supports IOMMU groups. I could add another helper to generic IOMMU code which would return the number of registered IOMMU groups but it is a bit too much :) @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping { #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr) #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr) +/* ioctl for SPAPR TCE IOMMU */ +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct kvm_create_spapr_tce_iommu) Shouldn't this go under the vm ioctl section? The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated devices) is in this section so I decided to keep them together. Wrong? You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with KVM_CREATE_SPAPR_TCE_IOMMU? Yes. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On 05/29/2013 09:35 AM, Scott Wood wrote: On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote: @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping { #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr) #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr) +/* ioctl for SPAPR TCE IOMMU */ +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct kvm_create_spapr_tce_iommu) Shouldn't this go under the vm ioctl section? The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated devices) is in this section so I decided to keep them together. Wrong? You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with KVM_CREATE_SPAPR_TCE_IOMMU? Yes. Sigh. That's the same thing repeated. There's only one IOCTL. Nothing is being kept together. Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On 05/29/2013 02:32 AM, Scott Wood wrote: On 05/24/2013 09:45:24 PM, David Gibson wrote: On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote: On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote: diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 8465c2a..da6bf61 100644 --- a/arch/powerpc/kvm/powerpc.c @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext) +++ b/arch/powerpc/kvm/powerpc.c break; #endif case KVM_CAP_SPAPR_MULTITCE: +case KVM_CAP_SPAPR_TCE_IOMMU: r = 1; break; default: Don't advertise SPAPR capabilities if it's not book3s -- and probably there's some additional limitation that would be appropriate. So, in the case of MULTITCE, that's not quite right. PR KVM can emulate a PAPR system on a BookE machine, and there's no reason not to allow TCE acceleration as well. That might (or might not; consider things like Altivec versus SPE opcode conflict, whether unimplemented SPRs trap, behavior of unprivileged SPRs/instructions, etc) be true in theory, but it's not currently a supported configuration. BookE KVM does not support emulating a different CPU than the host. In the unlikely case that ever changes to the point of allowing PAPR guests on a BookE host, then we can revisit how to properly determine whether the capability is supported, but for now the capability will never be valid in the CONFIG_BOOKE case (though I'd rather see it depend on an appropriate book3s symbol than depend on !BOOKE). Or we could just leave it as is, and let it indicate whether the host kernel supports the feature in general, with the user needing to understand when it's applicable... I'm a bit confused by the documentation, however -- the MULTITCE capability was documented in the capabilities that can be enabled section, but I don't see where it can be enabled. True, it cannot be enabled (but it could be enabled a long time ago), it is either supported or not, I'll fix the documentation. Thanks! -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On 05/30/2013 06:05 AM, Scott Wood wrote: On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote: On 05/29/2013 09:35 AM, Scott Wood wrote: On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote: @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping { #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr) #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr) +/* ioctl for SPAPR TCE IOMMU */ +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct kvm_create_spapr_tce_iommu) Shouldn't this go under the vm ioctl section? The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated devices) is in this section so I decided to keep them together. Wrong? You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with KVM_CREATE_SPAPR_TCE_IOMMU? Yes. Sigh. That's the same thing repeated. There's only one IOCTL. Nothing is being kept together. Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE. But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE. 0xe0 begins a different section. It is not really obvious that there are sections as no comment defines those :) But yes, makes sense to move it up a bit and change the code to 0xad. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On 05/30/2013 09:14 AM, Scott Wood wrote: On 05/29/2013 06:10:33 PM, Alexey Kardashevskiy wrote: On 05/30/2013 06:05 AM, Scott Wood wrote: On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote: On 05/29/2013 09:35 AM, Scott Wood wrote: On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote: @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping { #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr) #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr) +/* ioctl for SPAPR TCE IOMMU */ +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct kvm_create_spapr_tce_iommu) Shouldn't this go under the vm ioctl section? The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated devices) is in this section so I decided to keep them together. Wrong? You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with KVM_CREATE_SPAPR_TCE_IOMMU? Yes. Sigh. That's the same thing repeated. There's only one IOCTL. Nothing is being kept together. Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE. But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE. 0xe0 begins a different section. It is not really obvious that there are sections as no comment defines those :) There is a comment /* ioctls for fds returned by KVM_CREATE_DEVICE */ Putting KVM_CREATE_DEVICE in there was mainly to avoid dealing with the ioctl number conflict mess in the vm-ioctl section, but at least that one is related to the device control API. :-) But yes, makes sense to move it up a bit and change the code to 0xad. 0xad is KVM_KVMCLOCK_CTRL That's it. I am _completely_ confused now. No system whatsoever :( What rule should I use in order to choose the number for my new ioctl? :) -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap
The current VFIO-on-POWER implementation supports only user mode driven mapping, i.e. QEMU is sending requests to map/unmap pages. However this approach is really slow, so we want to move that to KVM. Since H_PUT_TCE can be extremely performance sensitive (especially with network adapters where each packet needs to be mapped/unmapped) we chose to implement that as a fast hypercall directly in real mode (processor still in the guest context but MMU off). To be able to do that, we need to provide some facilities to access the struct page count within that real mode environment as things like the sparsemem vmemmap mappings aren't accessible. This adds an API to increment/decrement page counter as get_user_pages API used for user mode mapping does not work in the real mode. CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Reviewed-by: Paul Mackerras pau...@samba.org Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Paul Mackerras pau...@samba.org --- Changes: 2013-05-20: * PageTail() is replaced by PageCompound() in order to have the same checks for whether the page is huge in realmode_get_page() and realmode_put_page() --- arch/powerpc/include/asm/pgtable-ppc64.h |4 ++ arch/powerpc/mm/init_64.c| 77 +- 2 files changed, 80 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index e3d55f6f..7b46e5f 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, } #endif /* !CONFIG_HUGETLB_PAGE */ +struct page *realmode_pfn_to_page(unsigned long pfn); +int realmode_get_page(struct page *page); +int realmode_put_page(struct page *page); + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */ diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index a90b9c4..ce3d8d4 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -297,5 +297,80 @@ void vmemmap_free(unsigned long start, unsigned long end) { } -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ +/* + * We do not have access to the sparsemem vmemmap, so we fallback to + * walking the list of sparsemem blocks which we already maintain for + * the sake of crashdump. In the long run, we might want to maintain + * a tree if performance of that linear walk becomes a problem. + * + * Any of realmode_ functions can fail due to: + * 1) As real sparsemem blocks do not lay in RAM continously (they + * are in virtual address space which is not available in the real mode), + * the requested page struct can be split between blocks so get_page/put_page + * may fail. + * 2) When huge pages are used, the get_page/put_page API will fail + * in real mode as the linked addresses in the page struct are virtual + * too. + * When 1) or 2) takes place, the API returns an error code to cause + * an exit to kernel virtual mode where the operation will be completed. + */ +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct vmemmap_backing *vmem_back; + struct page *page; + unsigned long page_size = 1 mmu_psize_defs[mmu_vmemmap_psize].shift; + unsigned long pg_va = (unsigned long) pfn_to_page(pfn); + + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back-list) { + if (pg_va vmem_back-virt_addr) + continue; + /* Check that page struct is not split between real pages */ + if ((pg_va + sizeof(struct page)) + (vmem_back-virt_addr + page_size)) + return NULL; + + page = (struct page *) (vmem_back-phys + pg_va - + vmem_back-virt_addr); + return page; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#elif defined(CONFIG_FLATMEM) + +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + return page; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */ + +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) +int realmode_get_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN; + + get_page(page); + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_get_page); + +int realmode_put_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN; + + if (!atomic_add_unless(page-_count, -1, 1)) + return -EAGAIN; + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_put_page); +#endif -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http
[PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) --- Documentation/virtual/kvm/api.txt | 17 ++ arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c| 118 ++ arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++ arch/powerpc/kvm/book3s_hv.c| 39 + arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 - arch/powerpc/kvm/powerpc.c |3 + include/uapi/linux/kvm.h|1 + 10 files changed, 473 insertions(+), 32 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be handled. +4.83 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability tells the guest that multiple TCE entry add/remove hypercalls +handling is supported by the kernel. This significanly accelerates DMA +operations for PPC KVM guests. + +Unlike other capabilities in this section, this one does not have an ioctl. +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index af326cd..85d8f26 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { spinlock_t tbacct_lock; u64 busy_stolen; u64 busy_preempt; + + unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hall */ #endif }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index a5287fe..e852921b 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, -unsigned long ioba, unsigned long tce); +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( + struct kvm_vcpu *vcpu, unsigned long liobn); +extern long kvmppc_emulated_validate_tce(unsigned long tce); +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt, + unsigned long ioba, unsigned long tce); +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce); +extern long kvmppc_virtmode_h_put_tce_indirect(struct
[PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests without passing them to QEMU, which should save time on switching to QEMU and back. Both real and virtual modes are supported - whenever the kernel fails to handle TCE request, it passes it to the virtual mode. If it the virtual mode handlers fail, then the request is passed to the user mode, for example, to QEMU. This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables in-kernel handling of IOMMU map/unmap. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org --- Changes: 2013/06/05: * changed capability number * changed ioctl number * update the doc article number 2013/05/20: * removed get_user() from real mode handlers * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there translated TCEs, tries realmode_get_page() on those and if it fails, it passes control over the virtual mode handler which tries to finish the request handling * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit on a page * The only reason to pass the request to user mode now is when the user mode did not register TCE table in the kernel, in all other cases the virtual mode handler is expected to do the job --- Documentation/virtual/kvm/api.txt | 28 + arch/powerpc/include/asm/kvm_host.h |3 + arch/powerpc/include/asm/kvm_ppc.h |2 + arch/powerpc/include/uapi/asm/kvm.h |7 ++ arch/powerpc/kvm/book3s_64_vio.c| 198 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 193 +- arch/powerpc/kvm/powerpc.c | 12 +++ include/uapi/linux/kvm.h|2 + 8 files changed, 439 insertions(+), 6 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 6c082ff..e962e3b 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2379,6 +2379,34 @@ the guest. Othwerwise it might be better for the guest to continue using H_PUT_T hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). +4.84 KVM_CREATE_SPAPR_TCE_IOMMU + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce_iommu (in) +Returns: 0 on success, -1 on error + +This creates a link between IOMMU group and a hardware TCE (translation +control entry) table. This link lets the host kernel know what IOMMU +group (i.e. TCE table) to use for the LIOBN number passed with +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. + +/* for KVM_CAP_SPAPR_TCE_IOMMU */ +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; + __u32 flags; +}; + +No flag is supported at the moment. + +When the guest issues TCE call on a liobn for which a TCE table has been +registered, the kernel will handle it in real mode, updating the hardware +TCE table. TCE table calls for other liobns will cause a vm exit and must +be handled by userspace. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 85d8f26..ac0e2fe 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table { struct kvm *kvm; u64 liobn; u32 window_size; + struct iommu_group *grp;/* used for IOMMU groups */ struct page *pages[0]; }; @@ -611,6 +612,8 @@ struct kvm_vcpu_arch { u64 busy_preempt; unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hall */ + unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */ + unsigned long tce_reason; /* The reason of switching to the virtmode */ #endif }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index e852921b..934e01d 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args); extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( struct kvm_vcpu *vcpu, unsigned long liobn); extern long kvmppc_emulated_validate_tce(unsigned long tce); diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 0fb1a6e..cf82af4 100644 --- a/arch/powerpc/include/uapi/asm
[PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
This adds special support for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org --- Changes: 2013/06/05: * fixed compile error when CONFIG_IOMMU_API=n 2013/05/20: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). --- arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h | 22 + arch/powerpc/kvm/book3s_64_vio.c| 88 +-- arch/powerpc/kvm/book3s_64_vio_hv.c | 40 ++-- 4 files changed, 146 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index ac0e2fe..4fc0865 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table { u64 liobn; u32 window_size; struct iommu_group *grp;/* used for IOMMU groups */ + struct list_head hugepages; /* used for IOMMU groups */ + spinlock_t hugepages_lock; /* used for IOMMU groups */ struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 934e01d..9054df0 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu, extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce_value, unsigned long npages); + +/* + * The KVM guest can be backed with 16MB pages (qemu switch + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/). + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +struct kvmppc_iommu_hugepage { + struct list_head list; + pte_t pte; /* Huge page PTE */ + unsigned long gpa; /* Guest physical address */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; + extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm, struct kvm_allocate_rma *rma); extern struct kvmppc_linear_info *kvm_alloc_rma(void); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index ffb4698..9e2ba4d 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -45,6 +45,71 @@ #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) #define ERROR_ADDR ((void *)~(unsigned long)0x0) +#ifdef CONFIG_IOMMU_API +/* Adds a new huge page descriptor to the list */ +static long kvmppc_iommu_hugepage_try_add( + struct kvmppc_spapr_tce_table *tt, + pte_t pte, unsigned long hva, unsigned long gpa, + unsigned long pg_size) +{ + long ret = 0; + struct kvmppc_iommu_hugepage *hp; + struct page *p; + + spin_lock(tt-hugepages_lock); + list_for_each_entry(hp, tt-hugepages, list) { + if (hp-pte == pte) + goto unlock_exit; + } + + hva = hva ~(pg_size - 1); + ret = get_user_pages_fast(hva, 1, true/*write*/, p); + if ((ret != 1) || !p) { + ret = -EFAULT; + goto unlock_exit; + } + ret = 0; + + hp = kzalloc(sizeof(*hp), GFP_KERNEL); + if (!hp) { + ret = -ENOMEM; + goto unlock_exit; + } + + hp-page = p; + hp-pte = pte; + hp-gpa = gpa ~(pg_size - 1); + hp
Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
On 06/17/2013 08:06 AM, Alexander Graf wrote: On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org Only a few minor nits. Ben already commented on implementation details. --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) --- Documentation/virtual/kvm/api.txt | 17 ++ arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c| 118 ++ arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++ arch/powerpc/kvm/book3s_hv.c| 39 + arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 - arch/powerpc/kvm/powerpc.c |3 + include/uapi/linux/kvm.h|1 + 10 files changed, 473 insertions(+), 32 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be handled. +4.83 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability tells the guest that multiple TCE entry add/remove hypercalls +handling is supported by the kernel. This significanly accelerates DMA +operations for PPC KVM guests. + +Unlike other capabilities in this section, this one does not have an ioctl. +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). While this describes perfectly well what the consequences are of the patches, it does not describe properly what the CAP actually expresses. The CAP only says this kernel is able to handle H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly. All other consequences are nice to document, but the semantics of the CAP are missing. ? It expresses ability to handle 2 hcalls. What is missing? We also usually try to keep KVM behavior unchanged with regards to older versions until a CAP is enabled. In this case I don't think it matters all that much, so I'm fine with declaring it as enabled by default. Please document that this is a change in behavior versus older KVM versions though. Ok! + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index af326cd..85d8f26 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { spinlock_t tbacct_lock; u64 busy_stolen; u64 busy_preempt; + +unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hall */ #endif }; [...] [...] diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 550f592
Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
On 06/17/2013 06:40 PM, Alexander Graf wrote: On 17.06.2013, at 10:34, Alexey Kardashevskiy wrote: On 06/17/2013 06:02 PM, Alexander Graf wrote: On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: On 06/17/2013 08:06 AM, Alexander Graf wrote: On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Signed-off-by: Paul Mackerras pau...@samba.org Only a few minor nits. Ben already commented on implementation details. --- Changelog: 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) --- Documentation/virtual/kvm/api.txt | 17 ++ arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c| 118 ++ arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++ arch/powerpc/kvm/book3s_hv.c | 39 + arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 - arch/powerpc/kvm/powerpc.c |3 + include/uapi/linux/kvm.h|1 + 10 files changed, 473 insertions(+), 32 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 5f91eda..6c082ff 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be handled. +4.83 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability tells the guest that multiple TCE entry add/remove hypercalls +handling is supported by the kernel. This significanly accelerates DMA +operations for PPC KVM guests. + +Unlike other capabilities in this section, this one does not have an ioctl. +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). While this describes perfectly well what the consequences are of the patches, it does not describe properly what the CAP actually expresses. The CAP only says this kernel is able to handle H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls directly. All other consequences are nice to document, but the semantics of the CAP are missing. ? It expresses ability to handle 2 hcalls. What is missing? You don't describe the kvm - qemu interface. You describe some decisions qemu can take from this cap. This file does not mention qemu at all. And the interface is - qemu (or kvmtool could do that) just adds hcall-multi-tce to ibm,hypertas-functions but this is for pseries linux and AIX could always do it (no idea about it). Does it really have to be in this file? Ok, let's go back a step. What does this CAP describe? Don't look at the description you wrote above. Just write a new one. The CAP means the kernel is capable of handling hcalls A and B without passing those into the user space. That accelerates DMA. What exactly can user space expect when it finds
Re: [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap
On 06/16/2013 02:26 PM, Benjamin Herrenschmidt wrote: +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) +int realmode_get_page(struct page *page) +{ +if (PageCompound(page)) +return -EAGAIN; + +get_page(page); + +return 0; +} +EXPORT_SYMBOL_GPL(realmode_get_page); + +int realmode_put_page(struct page *page) +{ +if (PageCompound(page)) +return -EAGAIN; + +if (!atomic_add_unless(page-_count, -1, 1)) +return -EAGAIN; + +return 0; +} +EXPORT_SYMBOL_GPL(realmode_put_page); +#endif Several worries here, mostly that if the generic code ever changes (something gets added to get_page() that makes it no-longer safe for use in real mode for example, or some other condition gets added to put_page()), we go out of sync and potentially end up with very hard and very subtle bugs. It might be worth making sure that: - This is reviewed by some generic VM people (and make sure they understand why we need to do that) - A comment is added to get_page() and put_page() to make sure that if they are changed in any way, dbl check the impact on our realmode_get_page() (or ping us to make sure things are still ok). After changing get_page() to get_page_unless_zero(), the get_page API I use is: get_page_unless_zero() - basically atomic_inc_not_zero() atomic_add_unless() - just operated with the counter PageCompound() - check if it is a huge page. No usage of get_page or put_page. If any of those changes, I would expect it to hit us immediately, no? So it may only make sense to add a comment to PageCompound(). But the comment says PageCompound is generally not used in hot code paths, and our path is hot. Heh. diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..c70a654 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -329,7 +329,8 @@ static inline void set_page_writeback(struct page *page) * System with lots of page flags available. This allows separate * flags for PageHead() and PageTail() checks of compound pages so that bit * tests can be used in performance sensitive paths. PageCompound is - * generally not used in hot code paths. + * generally not used in hot code paths except arch/powerpc/mm/init_64.c + * which uses it to detect huge pages and avoid handling those in real mode. */ __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head) __PAGEFLAG(Tail, tail) So? -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On 06/16/2013 02:39 PM, Benjamin Herrenschmidt wrote: static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, -unsigned long *pte_sizep) +unsigned long *pte_sizep, bool do_get_page) { pte_t *ptep; unsigned int shift = 0; @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, if (!pte_present(*ptep)) return __pte(0); +/* + * Put huge pages handling to the virtual mode. + * The only exception is for TCE list pages which we + * do need to call get_page() for. + */ +if ((*pte_sizep PAGE_SIZE) do_get_page) +return __pte(0); + /* wait until _PAGE_BUSY is clear then set it atomically */ __asm__ __volatile__ ( 1: ldarx %0,0,%3\n @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing, : cc); ret = pte; +if (do_get_page pte_present(pte) (!writing || pte_write(pte))) { +struct page *pg = NULL; +pg = realmode_pfn_to_page(pte_pfn(pte)); +if (realmode_get_page(pg)) { +ret = __pte(0); +} else { +pte = pte_mkyoung(pte); +if (writing) +pte = pte_mkdirty(pte); +} +} +*ptep = pte;/* clears _PAGE_BUSY */ return ret; } So now you are adding the clearing of _PAGE_BUSY that was missing for your first patch, except that this is not enough since that means that in the emulated case (ie, !do_get_page) you will in essence return and then use a PTE that is not locked without any synchronization to ensure that the underlying page doesn't go away... then you'll dereference that page. So either make everything use speculative get_page, or make the emulated case use the MMU notifier to drop the operation in case of collision. The former looks easier. Also, any specific reason why you do: - Lock the PTE - get_page() - Unlock the PTE Instead of - Read the PTE - get_page_unless_zero - re-check PTE Like get_user_pages_fast() does ? The former will be two atomic ops, the latter only one (faster), but maybe you have a good reason why that can't work... If we want to set dirty and young bits for pte then I do not know how to avoid _PAGE_BUSY. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On 06/20/2013 01:49 AM, Alex Williamson wrote: On Thu, 2013-06-20 at 00:50 +1000, Benjamin Herrenschmidt wrote: On Wed, 2013-06-19 at 11:58 +0200, Alexander Graf wrote: Alex, any objection ? Which Alex? :) Heh, mostly Williamson in this specific case but your input is still welcome :-) I think validate works, it keeps iteration logic out of the kernel which is a good thing. There still needs to be an interface for getting the iommu id in VFIO, but I suppose that one's for the other Alex and Jörg to comment on. I think getting the iommu fd is already covered by separate patches from Alexey. Do we need to make it a get/put interface instead ? vfio_validate_and_use_iommu(file, iommu_id); vfio_release_iommu(file, iommu_id); To ensure that the resource remains owned by the process until KVM is closed as well ? Or do we want to register with VFIO with a callback so that VFIO can call us if it needs us to give it up ? Can't we just register a handler on the fd and get notified when it closes? Can you kill VFIO access without closing the fd? That sounds actually harder :-) The question is basically: When we validate that relationship between a specific VFIO struct file with an iommu, what is the lifetime of that and how do we handle this lifetime properly. There's two ways for that sort of situation: The notification model where we get notified when the relationship is broken, and the refcount model where we become a user and thus delay the breaking of the relationship until we have been disposed of as well. In this specific case, it's hard to tell what is the right model from my perspective, which is why I would welcome Alex (W.) input. In the end, the solution will end up being in the form of APIs exposed by VFIO for use by KVM (via that symbol lookup mechanism) so Alex (W), as owner of VFIO at this stage, what do you want those to look like ? :-) My first thought is that we should use the same reference counting as we have for vfio devices (group-container_users). An interface for that might look like: int vfio_group_add_external_user(struct file *filep) { struct vfio_group *group = filep-private_data; if (filep-f_op != vfio_group_fops) return -EINVAL; if (!atomic_inc_not_zero(group-container_users)) return -EINVAL; return 0; } void vfio_group_del_external_user(struct file *filep) { struct vfio_group *group = filep-private_data; BUG_ON(filep-f_op != vfio_group_fops); vfio_group_try_dissolve_container(group); } int vfio_group_iommu_id_from_file(struct file *filep) { struct vfio_group *group = filep-private_data; BUG_ON(filep-f_op != vfio_group_fops); return iommu_group_id(group-iommu_group); } Would that work? Thanks, Just out of curiosity - would not get_file() and fput_atomic() on a group's file* do the right job instead of vfio_group_add_external_user() and vfio_group_del_external_user()? -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: Just out of curiosity - would not get_file() and fput_atomic() on a group's file* do the right job instead of vfio_group_add_external_user() and vfio_group_del_external_user()? I was thinking that too. Grabbing a file reference would certainly be the usual way of handling this sort of thing. But that wouldn't prevent the group ownership to be returned to the kernel or another user would it ? Holding the file pointer does not let the group-container_users counter go to zero and this is exactly what vfio_group_add_external_user() and vfio_group_del_external_user() do. The difference is only in absolute value - 2 vs. 3. No change in behaviour whether I use new vfio API or simply hold file* till KVM closes fd created when IOMMU was connected to LIOBN. And while this counter is not zero, QEMU cannot take ownership over the group. I am definitely still missing the bigger picture... -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
On 06/21/2013 12:55 AM, Alex Williamson wrote: On Thu, 2013-06-20 at 18:48 +1000, Alexey Kardashevskiy wrote: On 06/20/2013 05:47 PM, Benjamin Herrenschmidt wrote: On Thu, 2013-06-20 at 15:28 +1000, David Gibson wrote: Just out of curiosity - would not get_file() and fput_atomic() on a group's file* do the right job instead of vfio_group_add_external_user() and vfio_group_del_external_user()? I was thinking that too. Grabbing a file reference would certainly be the usual way of handling this sort of thing. But that wouldn't prevent the group ownership to be returned to the kernel or another user would it ? Holding the file pointer does not let the group-container_users counter go to zero How so? Holding the file pointer means the file won't go away, which means the group release function won't be called. That means the group won't go away, but that doesn't mean it's attached to an IOMMU. A user could call UNSET_CONTAINER. and this is exactly what vfio_group_add_external_user() and vfio_group_del_external_user() do. The difference is only in absolute value - 2 vs. 3. No change in behaviour whether I use new vfio API or simply hold file* till KVM closes fd created when IOMMU was connected to LIOBN. By that notion you could open(/dev/vfio/$GROUP) and you're safe, right? But what about SET_CONTAINER SET_IOMMU? All that you guarantee holding the file pointer is that the vfio_group exists. And while this counter is not zero, QEMU cannot take ownership over the group. I am definitely still missing the bigger picture... The bigger picture is that the group needs to exist AND it needs to be setup and maintained to have IOMMU protection. Actually, my first stab at add_external_user doesn't look sufficient, it needs to look more like vfio_group_get_device_fd, checking group-container-iommu and group_viable(). This makes sense. If you did this, that would be great. Without it, I really cannot see how the proposed inc/dec of container_users is better than simple holding file*. Thanks. As written it would allow an external user after SET_CONTAINER without SET_IOMMU. It should also be part of the API that the external user must hold the file reference between add_external_use and del_external_user and do cleanup on any exit paths. Thanks, -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8 v4] KVM: PPC: IOMMU in-kernel handling
The changes are: 1. rebased on v3.10-rc7 2. removed spinlocks from real mode 3. added security checks between KVM and VFIO MOre details in the individual patch comments. Alexey Kardashevskiy (8): KVM: PPC: reserve a capability number for multitce support KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO vfio: add external user support hashtable: add hash_for_each_possible_rcu_notrace() powerpc: Prepare to support kernel handling of IOMMU map/unmap KVM: PPC: Add support for multiple-TCE hcalls KVM: PPC: Add support for IOMMU in-kernel handling KVM: PPC: Add hugepage support for IOMMU in-kernel handling Documentation/virtual/kvm/api.txt| 51 +++ arch/powerpc/include/asm/kvm_host.h | 31 ++ arch/powerpc/include/asm/kvm_ppc.h | 18 +- arch/powerpc/include/asm/pgtable-ppc64.h |4 + arch/powerpc/include/uapi/asm/kvm.h |8 + arch/powerpc/kvm/book3s_64_vio.c | 506 +- arch/powerpc/kvm/book3s_64_vio_hv.c | 439 -- arch/powerpc/kvm/book3s_hv.c | 41 ++- arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 + arch/powerpc/kvm/book3s_pr_papr.c| 37 ++- arch/powerpc/kvm/powerpc.c | 15 + arch/powerpc/mm/init_64.c| 78 - drivers/vfio/vfio.c | 53 include/linux/hashtable.h| 15 + include/linux/page-flags.h |4 +- include/uapi/linux/kvm.h |3 + 16 files changed, 1279 insertions(+), 30 deletions(-) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/8] KVM: PPC: Add support for IOMMU in-kernel handling
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests without passing them to QEMU, which saves time on switching to QEMU and back. Both real and virtual modes are supported. First the kernel tries to handle a TCE request in the real mode, if failed it passes it to the virtual mode to complete the operation. If it a virtual mode handler fails, a request is passed to the user mode. This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to associate a virtual PCI bus ID (LIOBN) with an IOMMU group which enables in-kernel handling of IOMMU map/unmap. The external user API support in VFIO is required. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: 2013/06/27: * tce_list page is referenced now in order to protect it from accident invalidation during H_PUT_TCE_INDIRECT execution * added use of the external user VFIO API 2013/06/05: * changed capability number * changed ioctl number * update the doc article number 2013/05/20: * removed get_user() from real mode handlers * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there translated TCEs, tries realmode_get_page() on those and if it fails, it passes control over the virtual mode handler which tries to finish the request handling * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit on a page * The only reason to pass the request to user mode now is when the user mode did not register TCE table in the kernel, in all other cases the virtual mode handler is expected to do the job --- Documentation/virtual/kvm/api.txt | 26 arch/powerpc/include/asm/kvm_host.h |4 + arch/powerpc/include/asm/kvm_ppc.h |2 + arch/powerpc/include/uapi/asm/kvm.h |8 + arch/powerpc/kvm/book3s_64_vio.c| 294 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 165 arch/powerpc/kvm/powerpc.c | 12 ++ 7 files changed, 509 insertions(+), 2 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 762c703..01b0dc2 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2387,6 +2387,32 @@ slows operations a lot. Unlike other capabilities of this section, this one is always enabled. +4.87 KVM_CREATE_SPAPR_TCE_IOMMU + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce_iommu (in) +Returns: 0 on success, -1 on error + +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; + __u32 flags; +}; + +This creates a link between IOMMU group and a hardware TCE (translation +control entry) table. This link lets the host kernel know what IOMMU +group (i.e. TCE table) to use for the LIOBN number passed with +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. + +In response to a TCE hypercall, the kernel looks for a TCE table descriptor +in the list and handles the hypercall in real or virtual modes if +the descriptor is found. Otherwise the hypercall is passed to the user mode. + +No flag is supported at the moment. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 3bf407b..716ab18 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -180,6 +180,8 @@ struct kvmppc_spapr_tce_table { struct kvm *kvm; u64 liobn; u32 window_size; + struct iommu_group *grp;/* used for IOMMU groups */ + struct file *vfio_filp; /* used for IOMMU groups */ struct page *pages[0]; }; @@ -611,6 +613,8 @@ struct kvm_vcpu_arch { u64 busy_preempt; unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hcall */ + unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */ + unsigned long tce_reason; /* The reason of switching to the virtmode */ #endif }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index e852921b..934e01d 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm, + struct kvm_create_spapr_tce_iommu *args); extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( struct kvm_vcpu *vcpu, unsigned long liobn); extern long kvmppc_emulated_validate_tce(unsigned long tce); diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch
[PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
This adds special support for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: 2013/06/27: * list of huge pages replaces with hashtable for better performance * spinlock removed from real mode and only protects insertion of new huge [ages descriptors into the hashtable 2013/06/05: * fixed compile error when CONFIG_IOMMU_API=n 2013/05/20: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). --- arch/powerpc/include/asm/kvm_host.h | 25 + arch/powerpc/kvm/book3s_64_vio.c| 95 +-- arch/powerpc/kvm/book3s_64_vio_hv.c | 24 +++-- 3 files changed, 138 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 716ab18..0ad6189 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -30,6 +30,7 @@ #include linux/kvm_para.h #include linux/list.h #include linux/atomic.h +#include linux/hashtable.h #include asm/kvm_asm.h #include asm/processor.h #include asm/page.h @@ -182,9 +183,33 @@ struct kvmppc_spapr_tce_table { u32 window_size; struct iommu_group *grp;/* used for IOMMU groups */ struct file *vfio_filp; /* used for IOMMU groups */ + DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */ + spinlock_t hugepages_write_lock;/* used for IOMMU groups */ struct page *pages[0]; }; +/* + * The KVM guest can be backed with 16MB pages. + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +#define KVMPPC_HUGEPAGE_HASH(gpa) hash_32(gpa 24, 32) + +struct kvmppc_iommu_hugepage { + struct hlist_node hash_node; + unsigned long gpa; /* Guest physical address */ + unsigned long hpa; /* Host physical address */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; + struct kvmppc_linear_info { void*base_virt; unsigned longbase_pfn; diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index a5d0195..6cedfe9 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -47,6 +47,78 @@ #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64)) #define ERROR_ADDR ((void *)~(unsigned long)0x0) +#ifdef CONFIG_IOMMU_API +/* Adds a new huge page descriptor to the hashtable */ +static long kvmppc_iommu_hugepage_try_add( + struct kvmppc_spapr_tce_table *tt, + pte_t pte, unsigned long hva, unsigned long gpa, + unsigned long pg_size) +{ + long ret = 0; + struct kvmppc_iommu_hugepage *hp; + struct page *pg; + unsigned key = KVMPPC_HUGEPAGE_HASH(gpa); + + spin_lock(tt-hugepages_write_lock); + hash_for_each_possible_rcu(tt-hash_tab, hp, hash_node, key) { + if (KVMPPC_HUGEPAGE_HASH(hp-gpa) != key) + continue; + if ((gpa hp-gpa) || (gpa = hp-gpa + hp-size)) + continue; + goto unlock_exit; + } + + hva = hva ~(pg_size - 1); + ret = get_user_pages_fast(hva, 1, true/*write*/, pg); + if ((ret != 1) || !pg) { + ret = -EFAULT; + goto unlock_exit; + } + ret = 0; + + hp = kzalloc(sizeof(*hp), GFP_KERNEL); + if (!hp) { + ret = -ENOMEM; + goto unlock_exit; + } + + hp-page = pg; + hp
[PATCH 5/8] powerpc: Prepare to support kernel handling of IOMMU map/unmap
The current VFIO-on-POWER implementation supports only user mode driven mapping, i.e. QEMU is sending requests to map/unmap pages. However this approach is really slow, so we want to move that to KVM. Since H_PUT_TCE can be extremely performance sensitive (especially with network adapters where each packet needs to be mapped/unmapped) we chose to implement that as a fast hypercall directly in real mode (processor still in the guest context but MMU off). To be able to do that, we need to provide some facilities to access the struct page count within that real mode environment as things like the sparsemem vmemmap mappings aren't accessible. This adds an API to increment/decrement page counter as get_user_pages API used for user mode mapping does not work in the real mode. CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported. Reviewed-by: Paul Mackerras pau...@samba.org Cc: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: 2013/06/27: * realmode_get_page() fixed to use get_page_unless_zero(). If failed, the call will be passed from real to virtual mode and safely handled. * added comment to PageCompound() in include/linux/page-flags.h. 2013/05/20: * PageTail() is replaced by PageCompound() in order to have the same checks for whether the page is huge in realmode_get_page() and realmode_put_page() --- arch/powerpc/include/asm/pgtable-ppc64.h |4 ++ arch/powerpc/mm/init_64.c| 78 +- include/linux/page-flags.h |4 +- 3 files changed, 84 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index e3d55f6f..7b46e5f 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, } #endif /* !CONFIG_HUGETLB_PAGE */ +struct page *realmode_pfn_to_page(unsigned long pfn); +int realmode_get_page(struct page *page); +int realmode_put_page(struct page *page); + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */ diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index a90b9c4..7031be3 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -297,5 +297,81 @@ void vmemmap_free(unsigned long start, unsigned long end) { } -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ +/* + * We do not have access to the sparsemem vmemmap, so we fallback to + * walking the list of sparsemem blocks which we already maintain for + * the sake of crashdump. In the long run, we might want to maintain + * a tree if performance of that linear walk becomes a problem. + * + * Any of realmode_ functions can fail due to: + * 1) As real sparsemem blocks do not lay in RAM continously (they + * are in virtual address space which is not available in the real mode), + * the requested page struct can be split between blocks so get_page/put_page + * may fail. + * 2) When huge pages are used, the get_page/put_page API will fail + * in real mode as the linked addresses in the page struct are virtual + * too. + * When 1) or 2) takes place, the API returns an error code to cause + * an exit to kernel virtual mode where the operation will be completed. + */ +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct vmemmap_backing *vmem_back; + struct page *page; + unsigned long page_size = 1 mmu_psize_defs[mmu_vmemmap_psize].shift; + unsigned long pg_va = (unsigned long) pfn_to_page(pfn); + + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back-list) { + if (pg_va vmem_back-virt_addr) + continue; + /* Check that page struct is not split between real pages */ + if ((pg_va + sizeof(struct page)) + (vmem_back-virt_addr + page_size)) + return NULL; + + page = (struct page *) (vmem_back-phys + pg_va - + vmem_back-virt_addr); + return page; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#elif defined(CONFIG_FLATMEM) + +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + return page; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */ + +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) +int realmode_get_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN; + + if (!get_page_unless_zero(page)) + return -EAGAIN; + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_get_page); + +int realmode_put_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN
[PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls
This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changelog: 2013/06/27: * fixed clear of BUSY bit in kvmppc_lookup_pte() * H_PUT_TCE_INDIRECT does realmode_get_page() now * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64 * updated doc 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) --- Documentation/virtual/kvm/api.txt | 25 +++ arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c| 123 ++ arch/powerpc/kvm/book3s_64_vio_hv.c | 270 +++ arch/powerpc/kvm/book3s_hv.c| 41 - arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 - arch/powerpc/kvm/powerpc.c |3 + 9 files changed, 490 insertions(+), 33 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 6365fef..762c703 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed to userspace to be handled. +4.86 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability means the kernel is capable of handling hypercalls +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user +space. This significanly accelerates DMA operations for PPC KVM guests. +The user space should expect that its handlers for these hypercalls +are not going to be called. + +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, +the user space might have to advertise it for the guest. For example, +IBM pSeries guest starts using them if hcall-multi-tce is present in +the ibm,hypertas-functions device-tree property. + +Without this capability, only H_PUT_TCE is handled by the kernel and +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended +unless the capability is present as passing hypercalls to the userspace +slows operations a lot. + +Unlike other capabilities of this section, this one is always enabled. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index af326cd..3bf407b 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { spinlock_t tbacct_lock; u64 busy_stolen; u64 busy_preempt; + + unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hcall */ #endif }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index a5287fe..e852921b 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce *args); -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, -unsigned long ioba, unsigned long tce); +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table( + struct kvm_vcpu *vcpu
[PATCH 4/8] hashtable: add hash_for_each_possible_rcu_notrace()
This adds hash_for_each_possible_rcu_notrace() which is basically a notrace clone of hash_for_each_possible_rcu() which cannot be used in real mode due to its tracing/debugging capability. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- include/linux/hashtable.h | 15 +++ 1 file changed, 15 insertions(+) diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h index a9df51f..af8b169 100644 --- a/include/linux/hashtable.h +++ b/include/linux/hashtable.h @@ -174,6 +174,21 @@ static inline void hash_del_rcu(struct hlist_node *node) member) /** + * hash_for_each_possible_rcu_notrace - iterate over all possible objects hashing + * to the same bucket in an rcu enabled hashtable in a rcu enabled hashtable + * @name: hashtable to iterate + * @obj: the type * to use as a loop cursor for each entry + * @member: the name of the hlist_node within the struct + * @key: the key of the objects to iterate over + * + * This is the same as hash_for_each_possible_rcu() except that it does + * not do any RCU debugging or tracing. + */ +#define hash_for_each_possible_rcu_notrace(name, obj, member, key) \ + hlist_for_each_entry_rcu_notrace(obj, name[hash_min(key, HASH_BITS(name))],\ + member) + +/** * hash_for_each_possible_safe - iterate over all possible objects hashing to the * same bucket safe against removals * @name: hashtable to iterate -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/8] KVM: PPC: reserve a capability number for multitce support
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- include/uapi/linux/kvm.h |1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index d88c8ee..970b1f5 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_IRQ_MPIC 90 #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 +#define KVM_CAP_SPAPR_MULTITCE 93 #ifdef KVM_CAP_IRQ_ROUTING -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/8] vfio: add external user support
VFIO is designed to be used via ioctls on file descriptors returned by VFIO. However in some situations support for an external user is required. The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to use the existing VFIO groups for exclusive access in real/virtual mode in the host kernel to avoid passing map/unmap requests to the user space which would made things pretty slow. The proposed protocol includes: 1. do normal VFIO init stuff such as opening a new container, attaching group(s) to it, setting an IOMMU driver for a container. When IOMMU is set for a container, all groups in it are considered ready to use by an external user. 2. pass a fd of the group we want to accelerate to KVM. KVM calls vfio_group_iommu_id_from_file() to verify if the group is initialized and IOMMU is set for it. The current TCE IOMMU driver marks the whole IOMMU table as busy when IOMMU is set for a container what this prevents other DMA users from allocating from it so it is safe to pass the group to the user space. 3. KVM increases the container users counter via vfio_group_add_external_user(). This prevents the VFIO group from being disposed prior to exiting KVM. 4. When KVM is finished and doing cleanup, it releases the group file and decrements the container users counter. Everything gets released. 5. KVM also keeps the group file as otherwise its fd might have been closed at the moment of KVM finish so vfio_group_del_external_user() call will not be possible. The vfio: Limit group opens patch is also required for the consistency. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- drivers/vfio/vfio.c | 53 +++ 1 file changed, 53 insertions(+) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index c488da5..54192b2 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -1370,6 +1370,59 @@ static const struct file_operations vfio_device_fops = { }; /** + * External user API, exported by symbols to be linked dynamically. + */ + +/* Allows an external user (for example, KVM) to lock an IOMMU group */ +static int vfio_group_add_external_user(struct file *filep) +{ + struct vfio_group *group = filep-private_data; + + if (filep-f_op != vfio_group_fops) + return -EINVAL; + + if (!atomic_inc_not_zero(group-container_users)) + return -EINVAL; + + return 0; +} +EXPORT_SYMBOL_GPL(vfio_group_add_external_user); + +/* Allows an external user (for example, KVM) to unlock an IOMMU group */ +static void vfio_group_del_external_user(struct file *filep) +{ + struct vfio_group *group = filep-private_data; + + BUG_ON(filep-f_op != vfio_group_fops); + + vfio_group_try_dissolve_container(group); +} +EXPORT_SYMBOL_GPL(vfio_group_del_external_user); + +/* + * Checks if a group for the specified file can be used by + * an external user and returns the IOMMU ID if external use is possible. + */ +static int vfio_group_iommu_id_from_file(struct file *filep) +{ + int ret; + struct vfio_group *group = filep-private_data; + + if (WARN_ON(filep-f_op != vfio_group_fops)) + return -EINVAL; + + if (0 == atomic_read(group-container_users) || + !group-container-iommu_driver || + !vfio_group_viable(group)) + return -EINVAL; + + ret = iommu_group_id(group-iommu_group); + + return ret; +} +EXPORT_SYMBOL_GPL(vfio_group_iommu_id_from_file); + +/** * Module/class support */ static char *vfio_devnode(struct device *dev, umode_t *mode) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] vfio: add external user support
On 06/27/2013 07:42 PM, Benjamin Herrenschmidt wrote: On Thu, 2013-06-27 at 16:59 +1000, Stephen Rothwell wrote: +/* Allows an external user (for example, KVM) to unlock an IOMMU group */ +static void vfio_group_del_external_user(struct file *filep) +{ + struct vfio_group *group = filep-private_data; + + BUG_ON(filep-f_op != vfio_group_fops); We usually reserve BUG_ON for situations where there is no way to continue running or continuing will corrupt the running kernel. Maybe WARN_ON() and return? Not even that. This is a user space provided fd, we shouldn't oops the kernel because we passed a wrong argument, just return -EINVAL or something like that (add a return code). I'll change to WARN_ON but... This is going to be called on KVM exit on a file pointer previously verified for correctness. If it is a wrong file*, then something went terribly wrong. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8 v5] KVM: PPC: IOMMU in-kernel handling
The changes are: 1. rebased on v3.10 2. added arch_spin_locks to protect TCE table in real mode 3. reworked VFIO external API 4. added missing bits for real mode handling of TCE requests on p7ioc MOre details in the individual patch comments. Depends on hashtable: add hash_for_each_possible_rcu_notrace(), posted earlier today. Alexey Kardashevskiy (8): KVM: PPC: reserve a capability number for multitce support KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO vfio: add external user support powerpc: Prepare to support kernel handling of IOMMU map/unmap powerpc: add real mode support for dma operations on powernv KVM: PPC: Add support for multiple-TCE hcalls KVM: PPC: Add support for IOMMU in-kernel handling KVM: PPC: Add hugepage support for IOMMU in-kernel handling Documentation/virtual/kvm/api.txt | 51 +++ arch/powerpc/include/asm/iommu.h | 9 +- arch/powerpc/include/asm/kvm_host.h | 37 ++ arch/powerpc/include/asm/kvm_ppc.h| 18 +- arch/powerpc/include/asm/machdep.h| 12 + arch/powerpc/include/asm/pgtable-ppc64.h | 4 + arch/powerpc/include/uapi/asm/kvm.h | 7 + arch/powerpc/kernel/iommu.c | 200 +++ arch/powerpc/kvm/book3s_64_vio.c | 541 +- arch/powerpc/kvm/book3s_64_vio_hv.c | 404 -- arch/powerpc/kvm/book3s_hv.c | 41 ++- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 +- arch/powerpc/kvm/powerpc.c| 15 + arch/powerpc/mm/init_64.c | 78 - arch/powerpc/platforms/powernv/pci-ioda.c | 26 +- arch/powerpc/platforms/powernv/pci.c | 38 ++- arch/powerpc/platforms/powernv/pci.h | 2 +- drivers/vfio/vfio.c | 35 ++ include/linux/page-flags.h| 4 +- include/linux/vfio.h | 7 + include/uapi/linux/kvm.h | 3 + 22 files changed, 1453 insertions(+), 122 deletions(-) -- 1.8.3.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] powerpc: Prepare to support kernel handling of IOMMU map/unmap
The current VFIO-on-POWER implementation supports only user mode driven mapping, i.e. QEMU is sending requests to map/unmap pages. However this approach is really slow, so we want to move that to KVM. Since H_PUT_TCE can be extremely performance sensitive (especially with network adapters where each packet needs to be mapped/unmapped) we chose to implement that as a fast hypercall directly in real mode (processor still in the guest context but MMU off). To be able to do that, we need to provide some facilities to access the struct page count within that real mode environment as things like the sparsemem vmemmap mappings aren't accessible. This adds an API to increment/decrement page counter as get_user_pages API used for user mode mapping does not work in the real mode. CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported. Reviewed-by: Paul Mackerras pau...@samba.org Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: 2013/06/27: * realmode_get_page() fixed to use get_page_unless_zero(). If failed, the call will be passed from real to virtual mode and safely handled. * added comment to PageCompound() in include/linux/page-flags.h. 2013/05/20: * PageTail() is replaced by PageCompound() in order to have the same checks for whether the page is huge in realmode_get_page() and realmode_put_page() Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++ arch/powerpc/mm/init_64.c| 78 +++- include/linux/page-flags.h | 4 +- 3 files changed, 84 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index e3d55f6f..7b46e5f 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, } #endif /* !CONFIG_HUGETLB_PAGE */ +struct page *realmode_pfn_to_page(unsigned long pfn); +int realmode_get_page(struct page *page); +int realmode_put_page(struct page *page); + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */ diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index a90b9c4..7031be3 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -297,5 +297,81 @@ void vmemmap_free(unsigned long start, unsigned long end) { } -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ +/* + * We do not have access to the sparsemem vmemmap, so we fallback to + * walking the list of sparsemem blocks which we already maintain for + * the sake of crashdump. In the long run, we might want to maintain + * a tree if performance of that linear walk becomes a problem. + * + * Any of realmode_ functions can fail due to: + * 1) As real sparsemem blocks do not lay in RAM continously (they + * are in virtual address space which is not available in the real mode), + * the requested page struct can be split between blocks so get_page/put_page + * may fail. + * 2) When huge pages are used, the get_page/put_page API will fail + * in real mode as the linked addresses in the page struct are virtual + * too. + * When 1) or 2) takes place, the API returns an error code to cause + * an exit to kernel virtual mode where the operation will be completed. + */ +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct vmemmap_backing *vmem_back; + struct page *page; + unsigned long page_size = 1 mmu_psize_defs[mmu_vmemmap_psize].shift; + unsigned long pg_va = (unsigned long) pfn_to_page(pfn); + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back-list) { + if (pg_va vmem_back-virt_addr) + continue; + + /* Check that page struct is not split between real pages */ + if ((pg_va + sizeof(struct page)) + (vmem_back-virt_addr + page_size)) + return NULL; + + page = (struct page *) (vmem_back-phys + pg_va - + vmem_back-virt_addr); + return page; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#elif defined(CONFIG_FLATMEM) + +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + return page; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */ + +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) +int realmode_get_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN; + + if (!get_page_unless_zero(page)) + return -EAGAIN; + + return 0; +} +EXPORT_SYMBOL_GPL(realmode_get_page); + +int realmode_put_page(struct page *page) +{ + if (PageCompound(page)) + return -EAGAIN
[PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls
This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changelog: 2013/07/06: * fixed number of wrong get_page()/put_page() calls 2013/06/27: * fixed clear of BUSY bit in kvmppc_lookup_pte() * H_PUT_TCE_INDIRECT does realmode_get_page() now * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64 * updated doc 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Documentation/virtual/kvm/api.txt | 25 +++ arch/powerpc/include/asm/kvm_host.h | 9 ++ arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c| 154 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 260 arch/powerpc/kvm/book3s_hv.c| 41 - arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 - arch/powerpc/kvm/powerpc.c | 3 + 9 files changed, 517 insertions(+), 34 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 6365fef..762c703 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed to userspace to be handled. +4.86 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability means the kernel is capable of handling hypercalls +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user +space. This significanly accelerates DMA operations for PPC KVM guests. +The user space should expect that its handlers for these hypercalls +are not going to be called. + +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, +the user space might have to advertise it for the guest. For example, +IBM pSeries guest starts using them if hcall-multi-tce is present in +the ibm,hypertas-functions device-tree property. + +Without this capability, only H_PUT_TCE is handled by the kernel and +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended +unless the capability is present as passing hypercalls to the userspace +slows operations a lot. + +Unlike other capabilities of this section, this one is always enabled. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index af326cd..20d04bd 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table { struct kvm *kvm; u64 liobn; u32 window_size; + struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat; struct page *pages[0]; }; @@ -609,6 +610,14 @@ struct kvm_vcpu_arch { spinlock_t tbacct_lock; u64 busy_stolen; u64 busy_preempt; + + unsigned long *tce_tmp_hpas;/* TCE cache for TCE_PUT_INDIRECT hcall */ + enum { + TCERM_NONE, + TCERM_GETPAGE, + TCERM_PUTTCE, + TCERM_PUTLIST, + } tce_rm_fail; /* failed stage of request processing */ #endif }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm
[PATCH 3/8] vfio: add external user support
VFIO is designed to be used via ioctls on file descriptors returned by VFIO. However in some situations support for an external user is required. The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to use the existing VFIO groups for exclusive access in real/virtual mode on a host to avoid passing map/unmap requests to the user space which would made things pretty slow. The proposed protocol includes: 1. do normal VFIO init stuff such as opening a new container, attaching group(s) to it, setting an IOMMU driver for a container. When IOMMU is set for a container, all groups in it are considered ready to use by an external user. 2. pass a fd of the group we want to accelerate to KVM. KVM calls vfio_group_get_external_user() to verify if the group is initialized, IOMMU is set for it and increment the container user counter to prevent the VFIO group from disposal prior to KVM exit. The current TCE IOMMU driver marks the whole IOMMU table as busy when IOMMU is set for a container what prevents other DMA users from allocating from it so it is safe to grant user space access to it. 3. KVM calls vfio_external_user_iommu_id() to obtian an IOMMU ID which KVM uses to get an iommu_group struct for later use. 4. When KVM is finished, it calls vfio_group_put_external_user() to release the VFIO group by decrementing the container user counter. Everything gets released. The vfio: Limit group opens patch is also required for the consistency. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index c488da5..57aa191 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -1370,6 +1370,62 @@ static const struct file_operations vfio_device_fops = { }; /** + * External user API, exported by symbols to be linked dynamically. + * + * The protocol includes: + * 1. do normal VFIO init operation: + * - opening a new container; + * - attaching group(s) to it; + * - setting an IOMMU driver for a container. + * When IOMMU is set for a container, all groups in it are + * considered ready to use by an external user. + * + * 2. The user space passed a group fd which we want to accelerate in + * KVM. KVM uses vfio_group_get_external_user() to verify that: + * - the group is initialized; + * - IOMMU is set for it. + * Then vfio_group_get_external_user() increments the container user + * counter to prevent the VFIO group from disposal prior to KVM exit. + * + * 3. KVM calls vfio_external_user_iommu_id() to know an IOMMU ID which + * KVM uses to get an iommu_group struct for later use. + * + * 4. When KVM is finished, it calls vfio_group_put_external_user() to + * release the VFIO group by decrementing the container user counter. + */ +struct vfio_group *vfio_group_get_external_user(struct file *filep) +{ + struct vfio_group *group = filep-private_data; + + if (filep-f_op != vfio_group_fops) + return NULL; + + if (!atomic_inc_not_zero(group-container_users)) + return NULL; + + if (!group-container-iommu_driver || + !vfio_group_viable(group)) { + atomic_dec(group-container_users); + return NULL; + } + + return group; +} +EXPORT_SYMBOL_GPL(vfio_group_get_external_user); + +void vfio_group_put_external_user(struct vfio_group *group) +{ + vfio_group_try_dissolve_container(group); +} +EXPORT_SYMBOL_GPL(vfio_group_put_external_user); + +int vfio_external_user_iommu_id(struct vfio_group *group) +{ + return iommu_group_id(group-iommu_group); +} +EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id); + +/** * Module/class support */ static char *vfio_devnode(struct device *dev, umode_t *mode) diff --git a/include/linux/vfio.h b/include/linux/vfio.h index ac8d488..24579a0 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -90,4 +90,11 @@ extern void vfio_unregister_iommu_driver( TYPE tmp; \ offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \ +/* + * External user API + */ +extern struct vfio_group *vfio_group_get_external_user(struct file *filep); +extern void vfio_group_put_external_user(struct vfio_group *group); +extern int vfio_external_user_iommu_id(struct vfio_group *group); + #endif /* VFIO_H */ -- 1.8.3.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
This is to reserve a capablity number for upcoming support of VFIO-IOMMU DMA operations in real mode. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- include/uapi/linux/kvm.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 970b1f5..0865c01 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_SPAPR_MULTITCE 93 +#define KVM_CAP_SPAPR_TCE_IOMMU 94 #ifdef KVM_CAP_IRQ_ROUTING @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_PPC_ALLOC_HTAB */ #define KVM_PPC_ALLOCATE_HTAB_IOWR(KVMIO, 0xa7, __u32) #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct kvm_create_spapr_tce) +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu) /* Available with KVM_CAP_RMA */ #define KVM_ALLOCATE_RMA _IOR(KVMIO, 0xa9, struct kvm_allocate_rma) /* Available with KVM_CAP_PPC_HTAB_FD */ -- 1.8.3.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/8] KVM: PPC: Add support for IOMMU in-kernel handling
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests without passing them to QEMU, which saves time on switching to QEMU and back. Both real and virtual modes are supported. First the kernel tries to handle a TCE request in the real mode, if failed it passes it to the virtual mode to complete the operation. If it a virtual mode handler fails, a request is passed to the user mode. This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to associate a virtual PCI bus ID (LIOBN) with an IOMMU group which enables in-kernel handling of IOMMU map/unmap. The external user API support in VFIO is required. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: 2013/07/06: * added realmode arch_spin_lock to protect TCE table from races in real and virtual modes * POWERPC IOMMU API is changed to support real mode * iommu_take_ownership and iommu_release_ownership are protected by iommu_table's locks * VFIO external user API use rewritten * multiple small fixes 2013/06/27: * tce_list page is referenced now in order to protect it from accident invalidation during H_PUT_TCE_INDIRECT execution * added use of the external user VFIO API 2013/06/05: * changed capability number * changed ioctl number * update the doc article number 2013/05/20: * removed get_user() from real mode handlers * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there translated TCEs, tries realmode_get_page() on those and if it fails, it passes control over the virtual mode handler which tries to finish the request handling * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit on a page * The only reason to pass the request to user mode now is when the user mode did not register TCE table in the kernel, in all other cases the virtual mode handler is expected to do the job Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Documentation/virtual/kvm/api.txt | 26 arch/powerpc/include/asm/iommu.h| 9 +- arch/powerpc/include/asm/kvm_host.h | 3 + arch/powerpc/include/asm/kvm_ppc.h | 2 + arch/powerpc/include/uapi/asm/kvm.h | 7 + arch/powerpc/kernel/iommu.c | 196 +++ arch/powerpc/kvm/book3s_64_vio.c| 299 +++- arch/powerpc/kvm/book3s_64_vio_hv.c | 129 arch/powerpc/kvm/powerpc.c | 12 ++ 9 files changed, 609 insertions(+), 74 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 762c703..01b0dc2 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2387,6 +2387,32 @@ slows operations a lot. Unlike other capabilities of this section, this one is always enabled. +4.87 KVM_CREATE_SPAPR_TCE_IOMMU + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce_iommu (in) +Returns: 0 on success, -1 on error + +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 iommu_id; + __u32 flags; +}; + +This creates a link between IOMMU group and a hardware TCE (translation +control entry) table. This link lets the host kernel know what IOMMU +group (i.e. TCE table) to use for the LIOBN number passed with +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. + +In response to a TCE hypercall, the kernel looks for a TCE table descriptor +in the list and handles the hypercall in real or virtual modes if +the descriptor is found. Otherwise the hypercall is passed to the user mode. + +No flag is supported at the moment. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 98d1422..0845505 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -78,6 +78,7 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ #ifdef CONFIG_IOMMU_API struct iommu_group *it_group; + arch_spinlock_t it_rm_lock; #endif }; @@ -159,9 +160,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl, extern int iommu_tce_put_param_check(struct iommu_table *tbl, unsigned long ioba, unsigned long tce); extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, - unsigned long hwaddr, enum dma_data_direction direction); -extern unsigned long iommu_clear_tce(struct iommu_table *tbl, - unsigned long entry); + unsigned long *hpas, unsigned long npages, bool rm); +extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, + unsigned long npages, bool rm); extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, unsigned long
[PATCH 1/8] KVM: PPC: reserve a capability number for multitce support
This is to reserve a capablity number for upcoming support of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls which support mulptiple DMA map/unmap operations per one call. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- include/uapi/linux/kvm.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index d88c8ee..970b1f5 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_IRQ_MPIC 90 #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 +#define KVM_CAP_SPAPR_MULTITCE 93 #ifdef KVM_CAP_IRQ_ROUTING -- 1.8.3.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
This adds special support for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: 2013/06/27: * list of huge pages replaces with hashtable for better performance * spinlock removed from real mode and only protects insertion of new huge [ages descriptors into the hashtable 2013/06/05: * fixed compile error when CONFIG_IOMMU_API=n 2013/05/20: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/kvm_host.h | 25 + arch/powerpc/kernel/iommu.c | 6 ++- arch/powerpc/kvm/book3s_64_vio.c| 104 +--- arch/powerpc/kvm/book3s_64_vio_hv.c | 21 ++-- 4 files changed, 146 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 53e61b2..a7508cf 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -30,6 +30,7 @@ #include linux/kvm_para.h #include linux/list.h #include linux/atomic.h +#include linux/hashtable.h #include asm/kvm_asm.h #include asm/processor.h #include asm/page.h @@ -182,10 +183,34 @@ struct kvmppc_spapr_tce_table { u32 window_size; struct iommu_group *grp;/* used for IOMMU groups */ struct vfio_group *vfio_grp;/* used for IOMMU groups */ + DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */ + spinlock_t hugepages_write_lock;/* used for IOMMU groups */ struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat; struct page *pages[0]; }; +/* + * The KVM guest can be backed with 16MB pages. + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)hash_32(gpa 24, 32) + +struct kvmppc_spapr_iommu_hugepage { + struct hlist_node hash_node; + unsigned long gpa; /* Guest physical address */ + unsigned long hpa; /* Host physical address */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; + struct kvmppc_linear_info { void*base_virt; unsigned longbase_pfn; diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 51678ec..e0b6eca 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, if (!pg) { ret = -EAGAIN; } else if (PageCompound(pg)) { - ret = -EAGAIN; + /* Hugepages will be released at KVM exit */ + ret = 0; } else { if (oldtce TCE_PCI_WRITE) SetPageDirty(pg); @@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, struct page *pg = pfn_to_page(oldtce PAGE_SHIFT); if (!pg) { ret = -EAGAIN; + } else if (PageCompound(pg)) { + /* Hugepages will be released at KVM exit */ + ret = 0; } else { if (oldtce TCE_PCI_WRITE) SetPageDirty(pg); diff --git a/arch
[PATCH 5/8] powerpc: add real mode support for dma operations on powernv
The existing TCE machine calls (tce_build and tce_free) only support virtual mode as they call __raw_writeq for TCE invalidation what fails in real mode. This introduces tce_build_rm and tce_free_rm real mode versions which do mostly the same but use Store Doubleword Caching Inhibited Indexed instruction for TCE invalidation. This new feature is going to be utilized by real mode support of VFIO. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/machdep.h| 12 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++-- arch/powerpc/platforms/powernv/pci.c | 38 ++- arch/powerpc/platforms/powernv/pci.h | 2 +- 4 files changed, 64 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index 92386fc..0c19eef 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -75,6 +75,18 @@ struct machdep_calls { long index); void(*tce_flush)(struct iommu_table *tbl); + /* _rm versions are for real mode use only */ + int (*tce_build_rm)(struct iommu_table *tbl, +long index, +long npages, +unsigned long uaddr, +enum dma_data_direction direction, +struct dma_attrs *attrs); + void(*tce_free_rm)(struct iommu_table *tbl, + long index, + long npages); + void(*tce_flush_rm)(struct iommu_table *tbl); + void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size, unsigned long flags, void *caller); void(*iounmap)(volatile void __iomem *token); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 2931d97..2797dec 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -68,6 +68,12 @@ define_pe_printk_level(pe_err, KERN_ERR); define_pe_printk_level(pe_warn, KERN_WARNING); define_pe_printk_level(pe_info, KERN_INFO); +static inline void rm_writed(unsigned long paddr, u64 val) +{ + __asm__ __volatile__(sync; stdcix %0,0,%1 + : : r (val), r (paddr) : memory); +} + static int pnv_ioda_alloc_pe(struct pnv_phb *phb) { unsigned long pe; @@ -442,7 +448,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev } static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, -u64 *startp, u64 *endp) +u64 *startp, u64 *endp, bool rm) { u64 __iomem *invalidate = (u64 __iomem *)tbl-it_index; unsigned long start, end, inc; @@ -471,7 +477,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, mb(); /* Ensure above stores are visible */ while (start = end) { -__raw_writeq(start, invalidate); + if (rm) + rm_writed((unsigned long) invalidate, start); + else + __raw_writeq(start, invalidate); start += inc; } @@ -483,7 +492,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, struct iommu_table *tbl, -u64 *startp, u64 *endp) +u64 *startp, u64 *endp, bool rm) { unsigned long start, end, inc; u64 __iomem *invalidate = (u64 __iomem *)tbl-it_index; @@ -502,22 +511,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, mb(); while (start = end) { - __raw_writeq(start, invalidate); + if (rm) + rm_writed((unsigned long) invalidate, start); + else + __raw_writeq(start, invalidate); start += inc; } } void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl, -u64 *startp, u64 *endp) +u64 *startp, u64 *endp, bool rm) { struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe, tce32_table); struct pnv_phb *phb = pe-phb; if (phb-type == PNV_PHB_IODA1) - pnv_pci_ioda1_tce_invalidate(tbl, startp, endp); + pnv_pci_ioda1_tce_invalidate(tbl, startp, endp, rm); else - pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp); + pnv_pci_ioda2_tce_invalidate(pe, tbl
Re: [PATCH 3/8] vfio: add external user support
On 07/09/2013 07:52 AM, Alex Williamson wrote: On Sun, 2013-07-07 at 01:07 +1000, Alexey Kardashevskiy wrote: VFIO is designed to be used via ioctls on file descriptors returned by VFIO. However in some situations support for an external user is required. The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to use the existing VFIO groups for exclusive access in real/virtual mode on a host to avoid passing map/unmap requests to the user space which would made things pretty slow. The proposed protocol includes: 1. do normal VFIO init stuff such as opening a new container, attaching group(s) to it, setting an IOMMU driver for a container. When IOMMU is set for a container, all groups in it are considered ready to use by an external user. 2. pass a fd of the group we want to accelerate to KVM. KVM calls vfio_group_get_external_user() to verify if the group is initialized, IOMMU is set for it and increment the container user counter to prevent the VFIO group from disposal prior to KVM exit. The current TCE IOMMU driver marks the whole IOMMU table as busy when IOMMU is set for a container what prevents other DMA users from allocating from it so it is safe to grant user space access to it. 3. KVM calls vfio_external_user_iommu_id() to obtian an IOMMU ID which KVM uses to get an iommu_group struct for later use. 4. When KVM is finished, it calls vfio_group_put_external_user() to release the VFIO group by decrementing the container user counter. Everything gets released. The vfio: Limit group opens patch is also required for the consistency. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index c488da5..57aa191 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -1370,6 +1370,62 @@ static const struct file_operations vfio_device_fops = { }; /** + * External user API, exported by symbols to be linked dynamically. + * + * The protocol includes: + * 1. do normal VFIO init operation: + * - opening a new container; + * - attaching group(s) to it; + * - setting an IOMMU driver for a container. + * When IOMMU is set for a container, all groups in it are + * considered ready to use by an external user. + * + * 2. The user space passed a group fd which we want to accelerate in + * KVM. KVM uses vfio_group_get_external_user() to verify that: + * - the group is initialized; + * - IOMMU is set for it. + * Then vfio_group_get_external_user() increments the container user + * counter to prevent the VFIO group from disposal prior to KVM exit. + * + * 3. KVM calls vfio_external_user_iommu_id() to know an IOMMU ID which + * KVM uses to get an iommu_group struct for later use. + * + * 4. When KVM is finished, it calls vfio_group_put_external_user() to + * release the VFIO group by decrementing the container user counter. nit, the interface is for any external user, not just kvm. s/KVM/An external user/ ? Or add the description below uses KVM just as an example of an external user? + */ +struct vfio_group *vfio_group_get_external_user(struct file *filep) +{ +struct vfio_group *group = filep-private_data; + +if (filep-f_op != vfio_group_fops) +return NULL; ERR_PTR(-EINVAL) There also needs to be a vfio_group_get(group) here and put in error cases. Is that because I do not hold a reference to the file anymore? + +if (!atomic_inc_not_zero(group-container_users)) +return NULL; ERR_PTR(-EINVAL) + +if (!group-container-iommu_driver || +!vfio_group_viable(group)) { +atomic_dec(group-container_users); +return NULL; ERR_PTR(-EINVAL) +} + +return group; +} +EXPORT_SYMBOL_GPL(vfio_group_get_external_user); + +void vfio_group_put_external_user(struct vfio_group *group) +{ +vfio_group_try_dissolve_container(group); And a vfio_group_put(group) here +} +EXPORT_SYMBOL_GPL(vfio_group_put_external_user); + +int vfio_external_user_iommu_id(struct vfio_group *group) +{ +return iommu_group_id(group-iommu_group); +} +EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id); + +/** * Module/class support */ static char *vfio_devnode(struct device *dev, umode_t *mode) diff --git a/include/linux/vfio.h b/include/linux/vfio.h index ac8d488..24579a0 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -90,4 +90,11 @@ extern void vfio_unregister_iommu_driver( TYPE tmp; \ offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \ +/* + * External user API + */ +extern struct vfio_group *vfio_group_get_external_user(struct file *filep); +extern void vfio_group_put_external_user(struct vfio_group *group); +extern int vfio_external_user_iommu_id(struct vfio_group *group); + #endif /* VFIO_H */ -- Alexey -- To unsubscribe from this list: send
Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
On 07/10/2013 03:32 AM, Alexander Graf wrote: On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote: This adds special support for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. Signed-off-by: Paul Mackerraspau...@samba.org Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- Changes: 2013/06/27: * list of huge pages replaces with hashtable for better performance So the only thing your patch description really talks about is not true anymore? * spinlock removed from real mode and only protects insertion of new huge [ages descriptors into the hashtable 2013/06/05: * fixed compile error when CONFIG_IOMMU_API=n 2013/05/20: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- arch/powerpc/include/asm/kvm_host.h | 25 + arch/powerpc/kernel/iommu.c | 6 ++- arch/powerpc/kvm/book3s_64_vio.c| 104 +--- arch/powerpc/kvm/book3s_64_vio_hv.c | 21 ++-- 4 files changed, 146 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 53e61b2..a7508cf 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -30,6 +30,7 @@ #includelinux/kvm_para.h #includelinux/list.h #includelinux/atomic.h +#includelinux/hashtable.h #includeasm/kvm_asm.h #includeasm/processor.h #includeasm/page.h @@ -182,10 +183,34 @@ struct kvmppc_spapr_tce_table { u32 window_size; struct iommu_group *grp;/* used for IOMMU groups */ struct vfio_group *vfio_grp;/* used for IOMMU groups */ +DECLARE_HASHTABLE(hash_tab, ilog2(64));/* used for IOMMU groups */ +spinlock_t hugepages_write_lock;/* used for IOMMU groups */ struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat; struct page *pages[0]; }; +/* + * The KVM guest can be backed with 16MB pages. + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)hash_32(gpa 24, 32) + +struct kvmppc_spapr_iommu_hugepage { +struct hlist_node hash_node; +unsigned long gpa;/* Guest physical address */ +unsigned long hpa;/* Host physical address */ +struct page *page;/* page struct of the very first subpage */ +unsigned long size;/* Huge page size (always 16MB at the moment) */ +}; + struct kvmppc_linear_info { void*base_virt; unsigned long base_pfn; diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 51678ec..e0b6eca 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, if (!pg) { ret = -EAGAIN; } else if (PageCompound(pg)) { -ret = -EAGAIN; +/* Hugepages will be released at KVM exit */ +ret = 0; } else { if (oldtce TCE_PCI_WRITE) SetPageDirty(pg); @@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, struct page *pg = pfn_to_page(oldtce PAGE_SHIFT); if (!pg) { ret = -EAGAIN; +} else if (PageCompound(pg)) { +/* Hugepages will be released at KVM exit */ +ret = 0; } else { if (oldtce TCE_PCI_WRITE) SetPageDirty(pg); diff --git a/arch/powerpc
Re: [PATCH 2/8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
On 07/10/2013 01:35 AM, Alexander Graf wrote: On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote: Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- include/uapi/linux/kvm.h |2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 970b1f5..0865c01 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_SPAPR_MULTITCE 93 +#define KVM_CAP_SPAPR_TCE_IOMMU 94 #ifdef KVM_CAP_IRQ_ROUTING @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_PPC_ALLOC_HTAB */ #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32) #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct kvm_create_spapr_tce) +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu) Please order them by number. Oh. Again :( We have had this discussion with Scott Wood here already. Where _exactly_ do you want me to put it? Many sections, not really ordered. Thank you. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls
On 07/10/2013 03:02 AM, Alexander Graf wrote: On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote: This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. We don't mention QEMU explicitly in KVM code usually. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Same as above. But really you're only giving recommendations here. What's the point? Please describe what the benefit of this patch is, not what some other random subsystem might do with the benefits it brings. Signed-off-by: Paul Mackerraspau...@samba.org Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- Changelog: 2013/07/06: * fixed number of wrong get_page()/put_page() calls 2013/06/27: * fixed clear of BUSY bit in kvmppc_lookup_pte() * H_PUT_TCE_INDIRECT does realmode_get_page() now * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64 * updated doc 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- Documentation/virtual/kvm/api.txt | 25 +++ arch/powerpc/include/asm/kvm_host.h | 9 ++ arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c| 154 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 260 arch/powerpc/kvm/book3s_hv.c| 41 - arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 - arch/powerpc/kvm/powerpc.c | 3 + 9 files changed, 517 insertions(+), 34 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 6365fef..762c703 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed to userspace to be handled. +4.86 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability means the kernel is capable of handling hypercalls +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user +space. This significanly accelerates DMA operations for PPC KVM guests. significanly? Please run this through a spell checker. +The user space should expect that its handlers for these hypercalls s/The// +are not going to be called. Is user space guaranteed they will not be called? Or can it still happen? ... if user space previously registered LIOBN in KVM (via KVM_CREATE_SPAPR_TCE or similar calls). ok? There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet and may never get there. +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, +the user space might have to advertise it for the guest. For example, +IBM pSeries guest starts using them if hcall-multi-tce is present in +the ibm,hypertas-functions device-tree property. This paragraph describes sPAPR. That's fine, but please document it as such. Also please check your grammar. + +Without this capability, only H_PUT_TCE is handled by the kernel and +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended +unless the capability is present as passing hypercalls to the userspace +slows operations a lot. + +Unlike other capabilities of this section, this one is always enabled. Why? Wouldn't that confuse older user space? How? Old user space won't check
Re: [PATCH 2/8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
On 07/10/2013 08:27 PM, Alexander Graf wrote: On 10.07.2013, at 01:35, Alexey Kardashevskiy wrote: On 07/10/2013 01:35 AM, Alexander Graf wrote: On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote: Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- include/uapi/linux/kvm.h |2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 970b1f5..0865c01 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_SPAPR_MULTITCE 93 +#define KVM_CAP_SPAPR_TCE_IOMMU 94 #ifdef KVM_CAP_IRQ_ROUTING @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_PPC_ALLOC_HTAB */ #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32) #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct kvm_create_spapr_tce) +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu) Please order them by number. Oh. Again :( We have had this discussion with Scott Wood here already. Where _exactly_ do you want me to put it? 8 lines further down. With a comment saying when it's available. Also why is it af, not ad? 0xad and 0xae are taken. Where should I have commented this? In the commit message? Or in the patch itself? Many sections, not really ordered. Thank you. They should all be ordered inside of their own categories. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls
On 07/10/2013 08:05 PM, Alexander Graf wrote: On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote: On 07/10/2013 03:02 AM, Alexander Graf wrote: On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote: This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. We don't mention QEMU explicitly in KVM code usually. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Same as above. But really you're only giving recommendations here. What's the point? Please describe what the benefit of this patch is, not what some other random subsystem might do with the benefits it brings. Signed-off-by: Paul Mackerraspau...@samba.org Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- Changelog: 2013/07/06: * fixed number of wrong get_page()/put_page() calls 2013/06/27: * fixed clear of BUSY bit in kvmppc_lookup_pte() * H_PUT_TCE_INDIRECT does realmode_get_page() now * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64 * updated doc 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- Documentation/virtual/kvm/api.txt | 25 +++ arch/powerpc/include/asm/kvm_host.h | 9 ++ arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c| 154 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 260 arch/powerpc/kvm/book3s_hv.c| 41 - arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 - arch/powerpc/kvm/powerpc.c | 3 + 9 files changed, 517 insertions(+), 34 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 6365fef..762c703 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed to userspace to be handled. +4.86 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability means the kernel is capable of handling hypercalls +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user +space. This significanly accelerates DMA operations for PPC KVM guests. significanly? Please run this through a spell checker. +The user space should expect that its handlers for these hypercalls s/The// +are not going to be called. Is user space guaranteed they will not be called? Or can it still happen? ... if user space previously registered LIOBN in KVM (via KVM_CREATE_SPAPR_TCE or similar calls). ok? How about this? The hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration. --- The target audience for this documentation is user space KVM API users. Someone developing kvm tool for example. They want to know implications specific CAPs have. There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet and may never get there. +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, +the user space might have to advertise it for the guest. For example, +IBM pSeries guest starts using them if hcall-multi-tce is present
Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
On 07/10/2013 03:32 AM, Alexander Graf wrote: On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote: This adds special support for huge pages (16MB). The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a list of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is huge and in the list, then no reference counting is done, otherwise an exit to virtual mode happens. The list is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this list is not very expensive. However this can change and we may want to optimize this. Signed-off-by: Paul Mackerraspau...@samba.org Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- Changes: 2013/06/27: * list of huge pages replaces with hashtable for better performance So the only thing your patch description really talks about is not true anymore? * spinlock removed from real mode and only protects insertion of new huge [ages descriptors into the hashtable 2013/06/05: * fixed compile error when CONFIG_IOMMU_API=n 2013/05/20: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- arch/powerpc/include/asm/kvm_host.h | 25 + arch/powerpc/kernel/iommu.c | 6 ++- arch/powerpc/kvm/book3s_64_vio.c| 104 +--- arch/powerpc/kvm/book3s_64_vio_hv.c | 21 ++-- 4 files changed, 146 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 53e61b2..a7508cf 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -30,6 +30,7 @@ #includelinux/kvm_para.h #includelinux/list.h #includelinux/atomic.h +#includelinux/hashtable.h #includeasm/kvm_asm.h #includeasm/processor.h #includeasm/page.h @@ -182,10 +183,34 @@ struct kvmppc_spapr_tce_table { u32 window_size; struct iommu_group *grp;/* used for IOMMU groups */ struct vfio_group *vfio_grp;/* used for IOMMU groups */ +DECLARE_HASHTABLE(hash_tab, ilog2(64));/* used for IOMMU groups */ +spinlock_t hugepages_write_lock;/* used for IOMMU groups */ struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat; struct page *pages[0]; }; +/* + * The KVM guest can be backed with 16MB pages. + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)hash_32(gpa 24, 32) + +struct kvmppc_spapr_iommu_hugepage { +struct hlist_node hash_node; +unsigned long gpa;/* Guest physical address */ +unsigned long hpa;/* Host physical address */ +struct page *page;/* page struct of the very first subpage */ +unsigned long size;/* Huge page size (always 16MB at the moment) */ +}; + struct kvmppc_linear_info { void*base_virt; unsigned long base_pfn; diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 51678ec..e0b6eca 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, if (!pg) { ret = -EAGAIN; } else if (PageCompound(pg)) { -ret = -EAGAIN; +/* Hugepages will be released at KVM exit */ +ret = 0; } else { if (oldtce TCE_PCI_WRITE) SetPageDirty(pg); @@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, struct page *pg = pfn_to_page(oldtce PAGE_SHIFT); if (!pg) { ret = -EAGAIN; +} else if (PageCompound(pg)) { +/* Hugepages will be released at KVM exit */ +ret = 0; } else { if (oldtce TCE_PCI_WRITE) SetPageDirty(pg); diff --git a/arch/powerpc
Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls
On 07/11/2013 10:51 PM, Alexander Graf wrote: On 11.07.2013, at 14:39, Benjamin Herrenschmidt wrote: On Thu, 2013-07-11 at 13:15 +0200, Alexander Graf wrote: And that's bad. Jeez, seriously. Don't argue this case. We enable new features individually unless we're 100% sure we can keep everything working. In this case an ENABLE_CAP doesn't hurt at all, because user space still needs to handle the hypercalls if it wants them anyways. But you get debugging for free for example. An ENABLE_CAP is utterly pointless. More bloat. But you seem to like it :-) I don't like bloat usually. But Alexey even had an #ifdef DEBUG in there to selectively disable in-kernel handling of multi-TCE. Not calling ENABLE_CAP would give him exactly that without ugly #ifdefs in the kernel. No, it would not give m anithing. My ugly debug was to disable realmode only and still leave virtual mode on, not to disable both real and virtual modes. It is a lot easier to disable in kernel handling in QEMU. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls
On 07/11/2013 10:58 PM, Benjamin Herrenschmidt wrote: On Thu, 2013-07-11 at 14:51 +0200, Alexander Graf wrote: I don't like bloat usually. But Alexey even had an #ifdef DEBUG in there to selectively disable in-kernel handling of multi-TCE. Not calling ENABLE_CAP would give him exactly that without ugly #ifdefs in the kernel. I don't see much point in disabling it... but ok, if that's a valuable feature, then shoot some VM level ENABLE_CAP (please don't iterate all VCPUs, that's gross). No use for me whatsoever as I only want to disable real more handlers and keep virtual mode handlers enabled (sometime, for debug only) and this capability is not about that - I can easily just not enable it in QEMU with the exactly the same effect. So please, fellas, decide whether I should iterate vcpu's or add ENABLE_CAP per KVM. Thanks. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
On 07/11/2013 11:41 PM, chandrashekar shastri wrote: Hi All, I complied the latest kernel 3.10.0+ pulled from the git on top of 3.10.0-rc5+ by enabling the new Virtualiztaion features. The compliation was sucessfull, when I rebooted the machine it fails to boot with error as systemd [1] : Failed to mount /dev : no such device. Is it problem with the KVM module? Wrong thread actually, would be better if you started the new one. And you may want to try this - http://patchwork.ozlabs.org/patch/256027/ -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/10] powerpc/iommu: rework to support realmode
The TCE tables handling may differ for real and virtual modes so additional ppc_md.tce_build_rm/ppc_md.tce_free_rm/ppc_md.tce_flush_rm handlers were introduced earlier. So this adds the following: 1. support for the new ppc_md calls; 2. ability to iommu_tce_build to process mupltiple entries per call; 3. arch_spin_lock to protect TCE table from races in both real and virtual modes; 4. proper TCE table protection from races with the existing IOMMU code in iommu_take_ownership/iommu_release_ownership; 5. hwaddr variable renamed to hpa as it better describes what it actually represents; 6. iommu_tce_direction is static now as it is not called from anywhere else. This will be used by upcoming real mode support of VFIO on POWER. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h | 9 +- arch/powerpc/kernel/iommu.c | 197 ++- 2 files changed, 135 insertions(+), 71 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index c34656a..b01bde1 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -78,6 +78,7 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ #ifdef CONFIG_IOMMU_API struct iommu_group *it_group; + arch_spinlock_t it_rm_lock; #endif }; @@ -152,9 +153,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl, extern int iommu_tce_put_param_check(struct iommu_table *tbl, unsigned long ioba, unsigned long tce); extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, - unsigned long hwaddr, enum dma_data_direction direction); -extern unsigned long iommu_clear_tce(struct iommu_table *tbl, - unsigned long entry); + unsigned long *hpas, unsigned long npages, bool rm); +extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, + unsigned long npages, bool rm); extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, unsigned long entry, unsigned long pages); extern int iommu_put_tce_user_mode(struct iommu_table *tbl, @@ -164,7 +165,5 @@ extern void iommu_flush_tce(struct iommu_table *tbl); extern int iommu_take_ownership(struct iommu_table *tbl); extern void iommu_release_ownership(struct iommu_table *tbl); -extern enum dma_data_direction iommu_tce_direction(unsigned long tce); - #endif /* __KERNEL__ */ #endif /* _ASM_IOMMU_H */ diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index b20ff17..0f56cac 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -903,7 +903,7 @@ void iommu_register_group(struct iommu_table *tbl, kfree(name); } -enum dma_data_direction iommu_tce_direction(unsigned long tce) +static enum dma_data_direction iommu_tce_direction(unsigned long tce) { if ((tce TCE_PCI_READ) (tce TCE_PCI_WRITE)) return DMA_BIDIRECTIONAL; @@ -914,7 +914,6 @@ enum dma_data_direction iommu_tce_direction(unsigned long tce) else return DMA_NONE; } -EXPORT_SYMBOL_GPL(iommu_tce_direction); void iommu_flush_tce(struct iommu_table *tbl) { @@ -972,73 +971,116 @@ int iommu_tce_put_param_check(struct iommu_table *tbl, } EXPORT_SYMBOL_GPL(iommu_tce_put_param_check); -unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry) -{ - unsigned long oldtce; - struct iommu_pool *pool = get_pool(tbl, entry); - - spin_lock((pool-lock)); - - oldtce = ppc_md.tce_get(tbl, entry); - if (oldtce (TCE_PCI_WRITE | TCE_PCI_READ)) - ppc_md.tce_free(tbl, entry, 1); - else - oldtce = 0; - - spin_unlock((pool-lock)); - - return oldtce; -} -EXPORT_SYMBOL_GPL(iommu_clear_tce); - int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, unsigned long entry, unsigned long pages) { - unsigned long oldtce; - struct page *page; - - for ( ; pages; --pages, ++entry) { - oldtce = iommu_clear_tce(tbl, entry); - if (!oldtce) - continue; - - page = pfn_to_page(oldtce PAGE_SHIFT); - WARN_ON(!page); - if (page) { - if (oldtce TCE_PCI_WRITE) - SetPageDirty(page); - put_page(page); - } - } - - return 0; + return iommu_free_tces(tbl, entry, pages, false); } EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages); -/* - * hwaddr is a kernel virtual address here (0xc... bazillion), - * tce_build converts it to a physical address. - */ +int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, + unsigned long npages, bool rm) +{ + int i, ret = 0, to_free = 0; + + if (rm !ppc_md.tce_free_rm) + return
Re: [PATCH 00/10 v6] KVM: PPC: IOMMU in-kernel handling
On 07/16/2013 10:53 AM, Alexey Kardashevskiy wrote: The changes are: 1. rebased on v3.11-rc1 so the capability numbers changed again 2. fixed multiple comments from maintainers 3. KVM: PPC: Add support for IOMMU in-kernel handling is split into 2 patches, the new one is powerpc/iommu: rework to support realmode. 4. IOMMU_API is now always enabled for KVM_BOOK3S_64. MOre details in the individual patch comments. Depends on hashtable: add hash_for_each_possible_rcu_notrace(), posted a while ago. Alexey Kardashevskiy (10): KVM: PPC: reserve a capability number for multitce support KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO Alex, could you please pull these 2 patches or tell what is wrong with them? Having them sooner in the kernel would let me ask for a headers update for QEMU and then I would try pushing miltiple TCE and VFIO support in QEMU. Thanks. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap
Ping, anyone, please? Ben needs ack from any of MM people before proceeding with this patch. Thanks! On 07/16/2013 10:53 AM, Alexey Kardashevskiy wrote: The current VFIO-on-POWER implementation supports only user mode driven mapping, i.e. QEMU is sending requests to map/unmap pages. However this approach is really slow, so we want to move that to KVM. Since H_PUT_TCE can be extremely performance sensitive (especially with network adapters where each packet needs to be mapped/unmapped) we chose to implement that as a fast hypercall directly in real mode (processor still in the guest context but MMU off). To be able to do that, we need to provide some facilities to access the struct page count within that real mode environment as things like the sparsemem vmemmap mappings aren't accessible. This adds an API to increment/decrement page counter as get_user_pages API used for user mode mapping does not work in the real mode. CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported. Cc: linux...@kvack.org Reviewed-by: Paul Mackerras pau...@samba.org Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: 2013/07/10: * adjusted comment (removed sentence about virtual mode) * get_page_unless_zero replaced with atomic_inc_not_zero to minimize effect of a possible get_page_unless_zero() rework (if it ever happens). 2013/06/27: * realmode_get_page() fixed to use get_page_unless_zero(). If failed, the call will be passed from real to virtual mode and safely handled. * added comment to PageCompound() in include/linux/page-flags.h. 2013/05/20: * PageTail() is replaced by PageCompound() in order to have the same checks for whether the page is huge in realmode_get_page() and realmode_put_page() Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++ arch/powerpc/mm/init_64.c| 76 +++- include/linux/page-flags.h | 4 +- 3 files changed, 82 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index 46db094..aa7b169 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -394,6 +394,10 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array, hpte_slot_array[index] = hidx 4 | 0x1 3; } +struct page *realmode_pfn_to_page(unsigned long pfn); +int realmode_get_page(struct page *page); +int realmode_put_page(struct page *page); + static inline char *get_hpte_slot_array(pmd_t *pmdp) { /* diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index d0cd9e4..dcbb806 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -300,5 +300,79 @@ void vmemmap_free(unsigned long start, unsigned long end) { } -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ +/* + * We do not have access to the sparsemem vmemmap, so we fallback to + * walking the list of sparsemem blocks which we already maintain for + * the sake of crashdump. In the long run, we might want to maintain + * a tree if performance of that linear walk becomes a problem. + * + * Any of realmode_ functions can fail due to: + * 1) As real sparsemem blocks do not lay in RAM continously (they + * are in virtual address space which is not available in the real mode), + * the requested page struct can be split between blocks so get_page/put_page + * may fail. + * 2) When huge pages are used, the get_page/put_page API will fail + * in real mode as the linked addresses in the page struct are virtual + * too. + */ +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct vmemmap_backing *vmem_back; + struct page *page; + unsigned long page_size = 1 mmu_psize_defs[mmu_vmemmap_psize].shift; + unsigned long pg_va = (unsigned long) pfn_to_page(pfn); + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back-list) { + if (pg_va vmem_back-virt_addr) + continue; + + /* Check that page struct is not split between real pages */ + if ((pg_va + sizeof(struct page)) + (vmem_back-virt_addr + page_size)) + return NULL; + + page = (struct page *) (vmem_back-phys + pg_va - + vmem_back-virt_addr); + return page; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#elif defined(CONFIG_FLATMEM) + +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + return page; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */ + +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM) +int realmode_get_page
Re: [PATCH 03/10] vfio: add external user support
On 07/23/2013 12:23 PM, Alex Williamson wrote: On Tue, 2013-07-16 at 10:53 +1000, Alexey Kardashevskiy wrote: VFIO is designed to be used via ioctls on file descriptors returned by VFIO. However in some situations support for an external user is required. The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to use the existing VFIO groups for exclusive access in real/virtual mode on a host to avoid passing map/unmap requests to the user space which would made things pretty slow. The protocol includes: 1. do normal VFIO init operation: - opening a new container; - attaching group(s) to it; - setting an IOMMU driver for a container. When IOMMU is set for a container, all groups in it are considered ready to use by an external user. 2. User space passes a group fd to an external user. The external user calls vfio_group_get_external_user() to verify that: - the group is initialized; - IOMMU is set for it. If both checks passed, vfio_group_get_external_user() increments the container user counter to prevent the VFIO group from disposal before KVM exits. 3. The external user calls vfio_external_user_iommu_id() to know an IOMMU ID. PPC64 KVM uses it to link logical bus number (LIOBN) with IOMMU ID. 4. When the external KVM finishes, it calls vfio_group_put_external_user() to release the VFIO group. This call decrements the container user counter. Everything gets released. The vfio: Limit group opens patch is also required for the consistency. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru This looks fine to me. Is the plan to add this through the ppc tree again? Thanks, Nope, better to add this through your tree. And faster for sure :) Thanks! -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/10 v7] KVM: PPC: IOMMU in-kernel handling
This accelerates VFIO DMA operations on POWER by moving them into kernel. The changes in this series are: 1. rebased on v3.11-rc3. 2. VFIO external user API will go through VFIO tree so it is excluded from this series. 3. As nobody ever reacted on hashtable: add hash_for_each_possible_rcu_notrace(), Ben suggested to push it via his tree so I included it to the series. 4. realmode_(get|put)_page is reworked. More details in the individual patch comments. Alexey Kardashevskiy (10): hashtable: add hash_for_each_possible_rcu_notrace() KVM: PPC: reserve a capability number for multitce support KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO powerpc: Prepare to support kernel handling of IOMMU map/unmap powerpc: add real mode support for dma operations on powernv KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently KVM: PPC: Add support for multiple-TCE hcalls powerpc/iommu: rework to support realmode KVM: PPC: Add support for IOMMU in-kernel handling KVM: PPC: Add hugepage support for IOMMU in-kernel handling Documentation/virtual/kvm/api.txt | 52 +++ arch/powerpc/include/asm/iommu.h | 9 +- arch/powerpc/include/asm/kvm_host.h | 37 +++ arch/powerpc/include/asm/kvm_ppc.h| 18 +- arch/powerpc/include/asm/machdep.h| 12 + arch/powerpc/include/asm/pgtable-ppc64.h | 2 + arch/powerpc/include/uapi/asm/kvm.h | 7 + arch/powerpc/kernel/iommu.c | 202 +++ arch/powerpc/kvm/Kconfig | 1 + arch/powerpc/kvm/book3s_64_vio.c | 533 +- arch/powerpc/kvm/book3s_64_vio_hv.c | 405 +-- arch/powerpc/kvm/book3s_hv.c | 41 ++- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +- arch/powerpc/kvm/book3s_pr_papr.c | 35 ++ arch/powerpc/kvm/powerpc.c| 15 + arch/powerpc/mm/init_64.c | 50 ++- arch/powerpc/platforms/powernv/pci-ioda.c | 47 ++- arch/powerpc/platforms/powernv/pci.c | 38 ++- arch/powerpc/platforms/powernv/pci.h | 3 +- include/linux/hashtable.h | 15 + include/linux/mm.h| 14 + include/linux/page-flags.h| 4 +- include/uapi/linux/kvm.h | 5 + 23 files changed, 1430 insertions(+), 123 deletions(-) -- 1.8.3.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/10] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
This adds special support for huge pages (16MB) in real mode. The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a hash table of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is in the hash table, then no reference counting is done, otherwise an exit to virtual mode happens. The hash table is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this hash table does not cost much. However this can change and we may want to optimize this. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: 2013/07/12: * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled for KVM_BOOK3S_64 2013/06/27: * list of huge pages replaces with hashtable for better performance * spinlock removed from real mode and only protects insertion of new huge [ages descriptors into the hashtable 2013/06/05: * fixed compile error when CONFIG_IOMMU_API=n 2013/05/20: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). Conflicts: arch/powerpc/kernel/iommu.c Conflicts: arch/powerpc/kernel/iommu.c Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/kvm_host.h | 25 arch/powerpc/kernel/iommu.c | 6 +- arch/powerpc/kvm/book3s_64_vio.c| 121 ++-- arch/powerpc/kvm/book3s_64_vio_hv.c | 32 +- 4 files changed, 175 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 4eeaf7d..c57b25a 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -31,6 +31,7 @@ #include linux/list.h #include linux/atomic.h #include linux/tracepoint.h +#include linux/hashtable.h #include asm/kvm_asm.h #include asm/processor.h #include asm/page.h @@ -183,9 +184,33 @@ struct kvmppc_spapr_tce_table { u32 window_size; struct iommu_group *grp;/* used for IOMMU groups */ struct vfio_group *vfio_grp;/* used for IOMMU groups */ + DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */ + spinlock_t hugepages_write_lock;/* used for IOMMU groups */ struct page *pages[0]; }; +/* + * The KVM guest can be backed with 16MB pages. + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)hash_32(gpa 24, 32) + +struct kvmppc_spapr_iommu_hugepage { + struct hlist_node hash_node; + unsigned long gpa; /* Guest physical address */ + unsigned long hpa; /* Host physical address */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; + struct kvmppc_linear_info { void*base_virt; unsigned longbase_pfn; diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 8314c80..e4a8135 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, if (!pg) { ret = -EAGAIN; } else if (PageCompound(pg)) { - ret = -EAGAIN; + /* Hugepages will be released at KVM exit */ + ret = 0; } else { if (oldtce TCE_PCI_WRITE) SetPageDirty(pg); @@ -1010,6 +1011,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, struct page *pg = pfn_to_page(oldtce PAGE_SHIFT); if (!pg) { ret = -EAGAIN; + } else if (PageCompound(pg)) { + /* Hugepages will be released at KVM exit */ + ret = 0; } else
[PATCH 09/10] KVM: PPC: Add support for IOMMU in-kernel handling
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests targeted an IOMMU TCE table without passing them to user space which saves time on switching to user space and back. Both real and virtual modes are supported. The kernel tries to handle a TCE request in the real mode, if fails it passes the request to the virtual mode to complete the operation. If it a virtual mode handler fails, the request is passed to user space. The first user of this is VFIO on POWER. The external user API in VFIO is required for this patch. The patch adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to associate a virtual PCI bus number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap requests. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: 2013/07/11: * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled for KVM_BOOK3S_64 * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense for this here but the next patch for hugepages support will use it more. 2013/07/06: * added realmode arch_spin_lock to protect TCE table from races in real and virtual modes * POWERPC IOMMU API is changed to support real mode * iommu_take_ownership and iommu_release_ownership are protected by iommu_table's locks * VFIO external user API use rewritten * multiple small fixes 2013/06/27: * tce_list page is referenced now in order to protect it from accident invalidation during H_PUT_TCE_INDIRECT execution * added use of the external user VFIO API 2013/06/05: * changed capability number * changed ioctl number * update the doc article number 2013/05/20: * removed get_user() from real mode handlers * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there translated TCEs, tries realmode_get_page() on those and if it fails, it passes control over the virtual mode handler which tries to finish the request handling * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit on a page * The only reason to pass the request to user mode now is when the user mode did not register TCE table in the kernel, in all other cases the virtual mode handler is expected to do the job Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Documentation/virtual/kvm/api.txt | 26 arch/powerpc/include/asm/kvm_host.h | 3 + arch/powerpc/include/asm/kvm_ppc.h | 2 + arch/powerpc/include/uapi/asm/kvm.h | 7 + arch/powerpc/kvm/book3s_64_vio.c| 296 +++- arch/powerpc/kvm/book3s_64_vio_hv.c | 122 +++ arch/powerpc/kvm/powerpc.c | 12 ++ 7 files changed, 463 insertions(+), 5 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 1c8942a..6ae65bd 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2408,6 +2408,32 @@ an implementation for these despite the in kernel acceleration. This capability is always enabled. +4.87 KVM_CREATE_SPAPR_TCE_IOMMU + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc +Type: vm ioctl +Parameters: struct kvm_create_spapr_tce_iommu (in) +Returns: 0 on success, -1 on error + +struct kvm_create_spapr_tce_iommu { + __u64 liobn; + __u32 fd; + __u32 flags; +}; + +This creates a link between IOMMU group and a hardware TCE (translation +control entry) table. This link lets the host kernel know what IOMMU +group (i.e. TCE table) to use for the LIOBN number passed with +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls. + +User space passes VFIO group fd. Using the external user VFIO API, +KVM tries gets IOMMU id from passed fd. If succeeded, acceleration +turns on. If failed, map/unmap requests are passed to user space. + +No flag is supported at the moment. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index b8fe3de..4eeaf7d 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table { struct kvm *kvm; u64 liobn; u32 window_size; + struct iommu_group *grp;/* used for IOMMU groups */ + struct vfio_group *vfio_grp;/* used for IOMMU groups */ struct page *pages[0]; }; @@ -612,6 +614,7 @@ struct kvm_vcpu_arch { u64 busy_preempt; unsigned long *tce_tmp_hpas;/* TCE cache for TCE_PUT_INDIRECT hcall */ + unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */ enum { TCERM_NONE, TCERM_GETPAGE, diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 0ce4691..297cab5 100644
[PATCH 06/10] KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently
It does not make much sense to have KVM in book3s-64bit and not to have IOMMU bits for PCI pass through support as it costs little and allows VFIO to function on book3s-kvm. Having IOMMU_API always enabled makes it unnecessary to have a lot of #ifdef IOMMU_API in arch/powerpc/kvm/book3s_64_vio*. With those ifdef's we could have only user space emulated devices accelerated (but not VFIO) which do not seem to be very useful. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/kvm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index c55c538..3b2b761 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -59,6 +59,7 @@ config KVM_BOOK3S_64 depends on PPC_BOOK3S_64 select KVM_BOOK3S_64_HANDLER select KVM + select SPAPR_TCE_IOMMU ---help--- Support running unmodified book3s_64 and book3s_32 guest kernels in virtual machines on book3s_64 host processors. -- 1.8.3.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/10] powerpc/iommu: rework to support realmode
The TCE tables handling may differ for real and virtual modes so additional ppc_md.tce_build_rm/ppc_md.tce_free_rm/ppc_md.tce_flush_rm handlers were introduced earlier. So this adds the following: 1. support for the new ppc_md calls; 2. ability to iommu_tce_build to process mupltiple entries per call; 3. arch_spin_lock to protect TCE table from races in both real and virtual modes; 4. proper TCE table protection from races with the existing IOMMU code in iommu_take_ownership/iommu_release_ownership; 5. hwaddr variable renamed to hpa as it better describes what it actually represents; 6. iommu_tce_direction is static now as it is not called from anywhere else. This will be used by upcoming real mode support of VFIO on POWER. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h | 9 +- arch/powerpc/kernel/iommu.c | 198 ++- 2 files changed, 136 insertions(+), 71 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index c34656a..b01bde1 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -78,6 +78,7 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ #ifdef CONFIG_IOMMU_API struct iommu_group *it_group; + arch_spinlock_t it_rm_lock; #endif }; @@ -152,9 +153,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl, extern int iommu_tce_put_param_check(struct iommu_table *tbl, unsigned long ioba, unsigned long tce); extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, - unsigned long hwaddr, enum dma_data_direction direction); -extern unsigned long iommu_clear_tce(struct iommu_table *tbl, - unsigned long entry); + unsigned long *hpas, unsigned long npages, bool rm); +extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, + unsigned long npages, bool rm); extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, unsigned long entry, unsigned long pages); extern int iommu_put_tce_user_mode(struct iommu_table *tbl, @@ -164,7 +165,5 @@ extern void iommu_flush_tce(struct iommu_table *tbl); extern int iommu_take_ownership(struct iommu_table *tbl); extern void iommu_release_ownership(struct iommu_table *tbl); -extern enum dma_data_direction iommu_tce_direction(unsigned long tce); - #endif /* __KERNEL__ */ #endif /* _ASM_IOMMU_H */ diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index b20ff17..8314c80 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -903,7 +903,7 @@ void iommu_register_group(struct iommu_table *tbl, kfree(name); } -enum dma_data_direction iommu_tce_direction(unsigned long tce) +static enum dma_data_direction iommu_tce_direction(unsigned long tce) { if ((tce TCE_PCI_READ) (tce TCE_PCI_WRITE)) return DMA_BIDIRECTIONAL; @@ -914,7 +914,6 @@ enum dma_data_direction iommu_tce_direction(unsigned long tce) else return DMA_NONE; } -EXPORT_SYMBOL_GPL(iommu_tce_direction); void iommu_flush_tce(struct iommu_table *tbl) { @@ -972,73 +971,117 @@ int iommu_tce_put_param_check(struct iommu_table *tbl, } EXPORT_SYMBOL_GPL(iommu_tce_put_param_check); -unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry) -{ - unsigned long oldtce; - struct iommu_pool *pool = get_pool(tbl, entry); - - spin_lock((pool-lock)); - - oldtce = ppc_md.tce_get(tbl, entry); - if (oldtce (TCE_PCI_WRITE | TCE_PCI_READ)) - ppc_md.tce_free(tbl, entry, 1); - else - oldtce = 0; - - spin_unlock((pool-lock)); - - return oldtce; -} -EXPORT_SYMBOL_GPL(iommu_clear_tce); - int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, unsigned long entry, unsigned long pages) { - unsigned long oldtce; - struct page *page; - - for ( ; pages; --pages, ++entry) { - oldtce = iommu_clear_tce(tbl, entry); - if (!oldtce) - continue; - - page = pfn_to_page(oldtce PAGE_SHIFT); - WARN_ON(!page); - if (page) { - if (oldtce TCE_PCI_WRITE) - SetPageDirty(page); - put_page(page); - } - } - - return 0; + return iommu_free_tces(tbl, entry, pages, false); } EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages); -/* - * hwaddr is a kernel virtual address here (0xc... bazillion), - * tce_build converts it to a physical address. - */ +int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, + unsigned long npages, bool rm) +{ + int i, ret = 0, to_free = 0; + + if (rm !ppc_md.tce_free_rm) + return
[PATCH 05/10] powerpc: add real mode support for dma operations on powernv
The existing TCE machine calls (tce_build and tce_free) only support virtual mode as they call __raw_writeq for TCE invalidation what fails in real mode. This introduces tce_build_rm and tce_free_rm real mode versions which do mostly the same but use Store Doubleword Caching Inhibited Indexed instruction for TCE invalidation. This new feature is going to be utilized by real mode support of VFIO. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: 2013/11/07: * added comment why stdcix cannot be used in virtual mode 2013/08/07: * tested on p7ioc and fixed a bug with realmode addresses Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/machdep.h| 12 arch/powerpc/platforms/powernv/pci-ioda.c | 47 +++ arch/powerpc/platforms/powernv/pci.c | 38 + arch/powerpc/platforms/powernv/pci.h | 3 +- 4 files changed, 81 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index 8b48090..07dd3b1 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -78,6 +78,18 @@ struct machdep_calls { long index); void(*tce_flush)(struct iommu_table *tbl); + /* _rm versions are for real mode use only */ + int (*tce_build_rm)(struct iommu_table *tbl, +long index, +long npages, +unsigned long uaddr, +enum dma_data_direction direction, +struct dma_attrs *attrs); + void(*tce_free_rm)(struct iommu_table *tbl, + long index, + long npages); + void(*tce_flush_rm)(struct iommu_table *tbl); + void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size, unsigned long flags, void *caller); void(*iounmap)(volatile void __iomem *token); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index d8140b1..5815f1d 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -70,6 +70,16 @@ define_pe_printk_level(pe_err, KERN_ERR); define_pe_printk_level(pe_warn, KERN_WARNING); define_pe_printk_level(pe_info, KERN_INFO); +/* + * stdcix is only supposed to be used in hypervisor real mode as per + * the architecture spec + */ +static inline void __raw_rm_writeq(u64 val, volatile void __iomem *paddr) +{ + __asm__ __volatile__(stdcix %0,0,%1 + : : r (val), r (paddr) : memory); +} + static int pnv_ioda_alloc_pe(struct pnv_phb *phb) { unsigned long pe; @@ -454,10 +464,13 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) } } -static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, -u64 *startp, u64 *endp) +static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe, +struct iommu_table *tbl, +u64 *startp, u64 *endp, bool rm) { - u64 __iomem *invalidate = (u64 __iomem *)tbl-it_index; + u64 __iomem *invalidate = rm? + (u64 __iomem *)pe-tce_inval_reg_phys: + (u64 __iomem *)tbl-it_index; unsigned long start, end, inc; start = __pa(startp); @@ -484,7 +497,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, mb(); /* Ensure above stores are visible */ while (start = end) { -__raw_writeq(start, invalidate); + if (rm) + __raw_rm_writeq(start, invalidate); + else + __raw_writeq(start, invalidate); start += inc; } @@ -496,10 +512,12 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, struct iommu_table *tbl, -u64 *startp, u64 *endp) +u64 *startp, u64 *endp, bool rm) { unsigned long start, end, inc; - u64 __iomem *invalidate = (u64 __iomem *)tbl-it_index; + u64 __iomem *invalidate = rm? + (u64 __iomem *)pe-tce_inval_reg_phys: + (u64 __iomem *)tbl-it_index; /* We'll invalidate DMA address in PE scope */ start = 0x2ul 60; @@ -515,22 +533,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, mb(); while (start = end) { - __raw_writeq(start, invalidate); + if (rm
[PATCH 07/10] KVM: PPC: Add support for multiple-TCE hcalls
This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a function to convert a guest physical address to a host virtual address in order to parse a TCE list from H_PUT_TCE_INDIRECT. This also implements the KVM_CAP_PPC_MULTITCE capability. When present, the hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration. Signed-off-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changelog: 2013/08/01 (v7): * realmode_get_page/realmode_put_page use was replaced with get_page_unless_zero/put_page_unless_one 2013/07/11: * addressed many, many comments from maintainers 2013/07/06: * fixed number of wrong get_page()/put_page() calls 2013/06/27: * fixed clear of BUSY bit in kvmppc_lookup_pte() * H_PUT_TCE_INDIRECT does realmode_get_page() now * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64 * updated doc 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Documentation/virtual/kvm/api.txt | 26 arch/powerpc/include/asm/kvm_host.h | 9 ++ arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c| 132 +++- arch/powerpc/kvm/book3s_64_vio_hv.c | 267 arch/powerpc/kvm/book3s_hv.c| 41 - arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +- arch/powerpc/kvm/book3s_pr_papr.c | 35 + arch/powerpc/kvm/powerpc.c | 3 + 9 files changed, 503 insertions(+), 34 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index ef925ea..1c8942a 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2382,6 +2382,32 @@ calls by the guest for that service will be passed to userspace to be handled. +4.86 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability means the kernel is capable of handling hypercalls +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user +space. This significantly accelerates DMA operations for PPC KVM guests. +User space should expect that its handlers for these hypercalls +are not going to be called if user space previously registered LIOBN +in KVM (via KVM_CREATE_SPAPR_TCE or similar calls). + +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, +user space might have to advertise it for the guest. For example, +IBM pSeries (sPAPR) guest starts using them if hcall-multi-tce is +present in the ibm,hypertas-functions device-tree property. + +The hypercalls mentioned above may or may not be processed successfully +in the kernel based fast path. If they can not be handled by the kernel, +they will get passed on to user space. So user space still has to have +an implementation for these despite the in kernel acceleration. + +This capability is always enabled. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index af326cd..b8fe3de 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -30,6 +30,7 @@ #include linux/kvm_para.h #include linux/list.h #include linux/atomic.h +#include linux/tracepoint.h #include asm/kvm_asm.h #include asm/processor.h #include asm/page.h @@ -609,6 +610,14 @@ struct kvm_vcpu_arch { spinlock_t tbacct_lock; u64 busy_stolen; u64 busy_preempt; + + unsigned long *tce_tmp_hpas;/* TCE cache for TCE_PUT_INDIRECT hcall */ + enum { + TCERM_NONE
[PATCH 02/10] KVM: PPC: reserve a capability number for multitce support
This is to reserve a capablity number for upcoming support of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls which support mulptiple DMA map/unmap operations per one call. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: 2013/07/16: * changed the number Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- include/uapi/linux/kvm.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index acccd08..99c2533 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_ARM_EL1_32BIT 93 +#define KVM_CAP_SPAPR_MULTITCE 94 #ifdef KVM_CAP_IRQ_ROUTING -- 1.8.3.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/10] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
This is to reserve a capablity number for upcoming support of VFIO-IOMMU DMA operations in real mode. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: 2013/07/16: * changed the number 2013/07/11: * changed order in a file, added comment about a gap in ioctl number Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- include/uapi/linux/kvm.h | 4 1 file changed, 4 insertions(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 99c2533..53c3f1f 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_ARM_EL1_32BIT 93 #define KVM_CAP_SPAPR_MULTITCE 94 +#define KVM_CAP_SPAPR_TCE_IOMMU 95 #ifdef KVM_CAP_IRQ_ROUTING @@ -933,6 +934,9 @@ struct kvm_s390_ucas_mapping { #define KVM_ARM_SET_DEVICE_ADDR _IOW(KVMIO, 0xab, struct kvm_arm_device_addr) /* Available with KVM_CAP_PPC_RTAS */ #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO, 0xac, struct kvm_rtas_token_args) +/* 0xad and 0xaf are already taken */ +/* Available with KVM_CAP_SPAPR_TCE_IOMMU */ +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu) /* ioctl for vm fd */ #define KVM_CREATE_DEVICE_IOWR(KVMIO, 0xe0, struct kvm_create_device) -- 1.8.3.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/10] hashtable: add hash_for_each_possible_rcu_notrace()
This adds hash_for_each_possible_rcu_notrace() which is basically a notrace clone of hash_for_each_possible_rcu() which cannot be used in real mode due to its tracing/debugging capability. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- include/linux/hashtable.h | 15 +++ 1 file changed, 15 insertions(+) diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h index a9df51f..af8b169 100644 --- a/include/linux/hashtable.h +++ b/include/linux/hashtable.h @@ -174,6 +174,21 @@ static inline void hash_del_rcu(struct hlist_node *node) member) /** + * hash_for_each_possible_rcu_notrace - iterate over all possible objects hashing + * to the same bucket in an rcu enabled hashtable in a rcu enabled hashtable + * @name: hashtable to iterate + * @obj: the type * to use as a loop cursor for each entry + * @member: the name of the hlist_node within the struct + * @key: the key of the objects to iterate over + * + * This is the same as hash_for_each_possible_rcu() except that it does + * not do any RCU debugging or tracing. + */ +#define hash_for_each_possible_rcu_notrace(name, obj, member, key) \ + hlist_for_each_entry_rcu_notrace(obj, name[hash_min(key, HASH_BITS(name))],\ + member) + +/** * hash_for_each_possible_safe - iterate over all possible objects hashing to the * same bucket safe against removals * @name: hashtable to iterate -- 1.8.3.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html