Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
On 09/12/16 04:55, Alex Williamson wrote: > On Thu, 8 Dec 2016 19:19:56 +1100 > Alexey Kardashevskiywrote: > >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO >> without passing them to user space which saves time on switching >> to user space and back. >> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM. >> KVM tries to handle a TCE request in the real mode, if failed >> it passes the request to the virtual mode to complete the operation. >> If it a virtual mode handler fails, the request is passed to >> the user space; this is not expected to happen though. >> >> To avoid dealing with page use counters (which is tricky in real mode), >> this only accelerates SPAPR TCE IOMMU v2 clients which are required >> to pre-register the userspace memory. The very first TCE request will >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view >> of the TCE table (iommu_table::it_userspace) is not allocated till >> the very first mapping happens and we cannot call vmalloc in real mode. >> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd >> and associates a physical IOMMU table with the SPAPR TCE table (which >> is a guest view of the hardware IOMMU table). The iommu_table object >> is referenced so we do not have to retrieve in real mode when hypercall >> happens. >> >> This does not implement the UNSET counterpart as there is no use for it - >> once the acceleration is enabled, the existing userspace won't >> disable it unless a VFIO container is detroyed so this adds necessary >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler. >> >> This uses the kvm->lock mutex to protect against a race between >> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's >> release() callback. >> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user >> space. >> >> This finally makes use of vfio_external_user_iommu_id() which was >> introduced quite some time ago and was considered for removal. >> >> Tests show that this patch increases transmission speed from 220MB/s >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). >> >> Signed-off-by: Alexey Kardashevskiy >> --- >> Documentation/virtual/kvm/devices/vfio.txt | 21 +- >> arch/powerpc/include/asm/kvm_host.h| 8 + >> arch/powerpc/include/asm/kvm_ppc.h | 5 + >> include/uapi/linux/kvm.h | 8 + >> arch/powerpc/kvm/book3s_64_vio.c | 302 >> + >> arch/powerpc/kvm/book3s_64_vio_hv.c| 178 + >> arch/powerpc/kvm/powerpc.c | 2 + >> virt/kvm/vfio.c| 108 +++ >> 8 files changed, 630 insertions(+), 2 deletions(-) >> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt >> b/Documentation/virtual/kvm/devices/vfio.txt >> index ef51740c67ca..ddb5a6512ab3 100644 >> --- a/Documentation/virtual/kvm/devices/vfio.txt >> +++ b/Documentation/virtual/kvm/devices/vfio.txt >> @@ -16,7 +16,24 @@ Groups: >> >> KVM_DEV_VFIO_GROUP attributes: >>KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking >> +kvm_device_attr.addr points to an int32_t file descriptor >> +for the VFIO group. >>KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking >> +kvm_device_attr.addr points to an int32_t file descriptor >> +for the VFIO group. >> + KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table >> +allocated by sPAPR KVM. >> +kvm_device_attr.addr points to a struct: >> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor >> -for the VFIO group. >> +struct kvm_vfio_spapr_tce { >> +__u32 argsz; >> +__s32 groupfd; >> +__s32 tablefd; >> +__u8pad[4]; >> +}; >> + >> +where >> +@argsz is the size of kvm_vfio_spapr_tce_liobn; >> +@groupfd is a file descriptor for a VFIO group; >> +@tablefd is a file descriptor for a TCE table allocated via >> +KVM_CREATE_SPAPR_TCE. >> diff --git a/arch/powerpc/include/asm/kvm_host.h >> b/arch/powerpc/include/asm/kvm_host.h >> index 28350a294b1e..94774503c70d 100644 >> --- a/arch/powerpc/include/asm/kvm_host.h >> +++ b/arch/powerpc/include/asm/kvm_host.h >> @@ -191,6 +191,13 @@ struct kvmppc_pginfo { >> atomic_t refcnt; >> }; >> >> +struct kvmppc_spapr_tce_iommu_table { >> +struct rcu_head rcu; >> +struct list_head next; >> +struct iommu_table *tbl; >> +atomic_t refs; >> +}; >> + >> struct kvmppc_spapr_tce_table { >> struct list_head list; >> struct kvm *kvm; >> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table { >> u32 page_shift; >> u64 offset; /* in pages */ >> u64 size;
[PATCH v2] cxl: prevent read/write to AFU config space while AFU not configured
During EEH recovery, we deconfigure all AFUs whilst leaving the corresponding vPHB and virtual PCI device in place. If something attempts to interact with the AFU's PCI config space (e.g. running lspci) after the AFU has been deconfigured and before it's reconfigured, cxl_pcie_{read,write}_config() will read invalid values from the deconfigured struct cxl_afu and proceed to Oops when they try to dereference pointers that have been set to NULL during deconfiguration. Add a rwsem to struct cxl_afu so we can prevent interaction with config space while the AFU is deconfigured. Reported-by: Pradipta GhoshSuggested-by: Frederic Barrat Cc: sta...@vger.kernel.org # v4.9+ Signed-off-by: Andrew Donnellan Signed-off-by: Vaibhav Jain --- v1 -> v2: * Refactored to avoid locking over function boundaries - we now both lock and unlock in cxl_pcie_{read,write}_config(), rather than locking in cxl_pcie_config_info() and unlocking from the caller. Thanks Vaibhav. * Changed the stable tag to 4.9 rather than 4.4 - by the time this is merged, 4.9 will have landed, and I'll need to manually backport this for 4.4. --- drivers/misc/cxl/cxl.h | 2 ++ drivers/misc/cxl/main.c | 3 ++- drivers/misc/cxl/pci.c | 2 ++ drivers/misc/cxl/vphb.c | 51 - 4 files changed, 35 insertions(+), 23 deletions(-) diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index a144073..379c463 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -418,6 +418,8 @@ struct cxl_afu { struct dentry *debugfs; struct mutex contexts_lock; spinlock_t afu_cntl_lock; + /* Used to block access to AFU config space while deconfigured */ + struct rw_semaphore configured_rwsem; /* AFU error buffer fields and bin attribute for sysfs */ u64 eb_len, eb_offset; diff --git a/drivers/misc/cxl/main.c b/drivers/misc/cxl/main.c index 62e0dfb..2a6bf1d 100644 --- a/drivers/misc/cxl/main.c +++ b/drivers/misc/cxl/main.c @@ -268,7 +268,8 @@ struct cxl_afu *cxl_alloc_afu(struct cxl *adapter, int slice) idr_init(>contexts_idr); mutex_init(>contexts_lock); spin_lock_init(>afu_cntl_lock); - + init_rwsem(>configured_rwsem); + down_write(>configured_rwsem); afu->prefault_mode = CXL_PREFAULT_NONE; afu->irqs_max = afu->adapter->user_irqs; diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c index c4d79b5d..c7b2121 100644 --- a/drivers/misc/cxl/pci.c +++ b/drivers/misc/cxl/pci.c @@ -1129,6 +1129,7 @@ static int pci_configure_afu(struct cxl_afu *afu, struct cxl *adapter, struct pc if ((rc = cxl_native_register_psl_irq(afu))) goto err2; + up_write(>configured_rwsem); return 0; err2: @@ -1141,6 +1142,7 @@ static int pci_configure_afu(struct cxl_afu *afu, struct cxl *adapter, struct pc static void pci_deconfigure_afu(struct cxl_afu *afu) { + down_write(>configured_rwsem); cxl_native_release_psl_irq(afu); if (afu->adapter->native->sl_ops->release_serr_irq) afu->adapter->native->sl_ops->release_serr_irq(afu); diff --git a/drivers/misc/cxl/vphb.c b/drivers/misc/cxl/vphb.c index 3519ace..639a343 100644 --- a/drivers/misc/cxl/vphb.c +++ b/drivers/misc/cxl/vphb.c @@ -76,23 +76,22 @@ static int cxl_pcie_cfg_record(u8 bus, u8 devfn) return (bus << 8) + devfn; } -static int cxl_pcie_config_info(struct pci_bus *bus, unsigned int devfn, - struct cxl_afu **_afu, int *_record) +static inline struct cxl_afu *pci_bus_to_afu(struct pci_bus *bus) { - struct pci_controller *phb; - struct cxl_afu *afu; - int record; + struct pci_controller *phb = bus ? pci_bus_to_host(bus) : NULL; - phb = pci_bus_to_host(bus); - if (phb == NULL) - return PCIBIOS_DEVICE_NOT_FOUND; + return phb ? phb->private_data : NULL; +} + +static inline int cxl_pcie_config_info(struct pci_bus *bus, unsigned int devfn, + struct cxl_afu *afu, int *_record) +{ + int record; - afu = (struct cxl_afu *)phb->private_data; record = cxl_pcie_cfg_record(bus->number, devfn); if (record > afu->crs_num) return PCIBIOS_DEVICE_NOT_FOUND; - *_afu = afu; *_record = record; return 0; } @@ -106,9 +105,14 @@ static int cxl_pcie_read_config(struct pci_bus *bus, unsigned int devfn, u16 val16; u32 val32; - rc = cxl_pcie_config_info(bus, devfn, , ); + afu = pci_bus_to_afu(bus); + /* Grab a reader lock on afu. */ + if (afu == NULL || !down_read_trylock(>configured_rwsem)) + return PCIBIOS_DEVICE_NOT_FOUND; + + rc = cxl_pcie_config_info(bus, devfn, afu, ); if (rc) - return rc; +
Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing
On 12/08/2016 01:06 AM, Johannes Thumshirn wrote: > On Wed, Dec 07, 2016 at 05:31:26PM -0600, Tyrel Datwyler wrote: >> The first byte of each CRQ entry is used to indicate whether an entry is >> a valid response or free for the VIOS to use. After processing a >> response the driver sets the valid byte to zero to indicate the entry is >> now free to be reused. Add a memory barrier after this write to ensure >> no other stores are reordered when updating the valid byte. >> >> Signed-off-by: Tyrel Datwyler>> --- >> drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c >> b/drivers/scsi/ibmvscsi/ibmvscsi.c >> index d9534ee..2f5b07e 100644 >> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c >> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c >> @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data) >> while ((crq = crq_queue_next_crq(>queue)) != NULL) { >> ibmvscsi_handle_crq(crq, hostdata); >> crq->valid = VIOSRP_CRQ_FREE; >> +wmb(); >> } >> >> vio_enable_interrupts(vdev); >> @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data) >> vio_disable_interrupts(vdev); >> ibmvscsi_handle_crq(crq, hostdata); >> crq->valid = VIOSRP_CRQ_FREE; >> +wmb(); >> } else { >> done = 1; >> } > > Is this something you have seen in the wild or just a "better save than sorry" > barrier? I myself have not observed or heard of anybody hitting an issue here. However, based on conversation with the VIOS developers, who have indicated it is required, this is a "better safe than sorry" scenario. Further, it matches what we already do in the ibmvfc driver for the CRQ processing logic. -Tyrel > > Thanks, > Johannes >
Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing
On 12/08/2016 03:29 PM, Paolo Bonzini wrote: > > > On 08/12/2016 00:31, Tyrel Datwyler wrote: >> The first byte of each CRQ entry is used to indicate whether an entry is >> a valid response or free for the VIOS to use. After processing a >> response the driver sets the valid byte to zero to indicate the entry is >> now free to be reused. Add a memory barrier after this write to ensure >> no other stores are reordered when updating the valid byte. >> >> Signed-off-by: Tyrel Datwyler>> --- >> drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c >> b/drivers/scsi/ibmvscsi/ibmvscsi.c >> index d9534ee..2f5b07e 100644 >> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c >> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c >> @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data) >> while ((crq = crq_queue_next_crq(>queue)) != NULL) { >> ibmvscsi_handle_crq(crq, hostdata); >> crq->valid = VIOSRP_CRQ_FREE; >> +wmb(); >> } >> >> vio_enable_interrupts(vdev); >> @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data) >> vio_disable_interrupts(vdev); >> ibmvscsi_handle_crq(crq, hostdata); >> crq->valid = VIOSRP_CRQ_FREE; >> +wmb(); > > Should this driver use virt_wmb instead? Both virt_wmb and wmb reduce to a lwsync instruction under PowerPC. -Tyrel > > Paolo > >> } else { >> done = 1; >> } >>
Re: [PATCH 3/3] powerpc: enable support for GCC plugins
On 09/12/16 05:06, Kees Cook wrote: i don't think that this is the right approach. there's a general and a special issue here, both of which need different handling. the general problem is to detect problems related to gcc plugin headers and notify the users about solutions. emitting various messages from a Makefile is certainly not a scalable approach, just imagine how it will look when the other 30+ archs begin to add their own special cases... if anything, they should be documented in Documentation/gcc-plugins.txt (or a new doc if it grows too big) and the Makefile message should just point at it. I think I agree in principle - Makefiles are already unreadable enough without a million special cases. as for the solutions, the general advice should enable the use of otherwise failing gcc versions instead of forcing updating to new ones (though the latter is advisable for other reasons but not everyone's in the position to do so easily). in my experience all one needs to do is manually install the missing files from the gcc sources (ideally distros would take care of it). If someone else is willing to write up that advice, then great. the specific problem addressed here can (and IMHO should) be solved in another way: remove the inclusion of the offending headers in gcc-common.h as neither tm.h nor c-common.h are needed by existing plugins. for background, We can't build without tm.h: http://pastebin.com/W0azfCr0 And we get warnings without c-common.h: http://pastebin.com/Aw8CAj10 as for the location of c-common.h, upstream gcc moved it under c-family in 2010 after the release of 4.5, so it should be where gcc-common.h expects it and i'm not sure how it ended up at its old location for you. That is rather odd. What distro was the PPC test done on? (Or were these manually built gcc versions?) These were all manually built using a script running on a Debian box. Installing precompiled distro versions of rather old gccs would have been somewhat challenging. I've just rebuilt 4.6.4 to double check that I wasn't just seeing things, but it seems that it definitely is still putting c-common.h in the old location. -- Andrew Donnellan OzLabs, ADL Canberra andrew.donnel...@au1.ibm.com IBM Australia Limited
Re: [PATCH v2] PCI: designware: add host_init error handling
Hi Srinivas, On 12/07/2016 07:32 PM, Srinivas Kandagatla wrote: > This patch add support to return value from host_init() callback from drivers, > so that the designware libary can handle or pass it to proper place. Issue > with > void return type is that errors or error handling within host_init() callback > are never know to designware code, which could go ahead and access registers > even in error cases. > > Typical case in qcom controller driver is to turn off clks in case of errors, > if designware code continues to read/write register when clocks are turned off > the board would reboot/lockup. Added the comment for minor thing. I agreed this approach. > > Signed-off-by: Srinivas Kandagatla> --- > Currently designware code does not have a way return errors generated > as part of host_init() callback in controller drivers. This is an issue > with controller drivers like qcom which turns off the clocks in error > handling path. As the dw core is un aware of this would continue to > access registers which faults resulting in board reboots/hangs. > > There are two ways to solve this issue, > one is remove error handling in the qcom controller host_init() function > other is to handle error and pass back to dw core code which would then > pass back to controller driver as part of dw_pcie_host_init() return value. > > Second option seems more sensible and correct way to fix the issue, > this patch does the same. > > As part of this change to host_init() return type I had to patch other > ihost controller drivers which use dw core. Most of the changes to other > drivers > are to return proper error codes to upper layer. > Only compile tested drivers. > > Changes since RFC: > - Add error handling to other drivers as suggested by Joao Pinto > > drivers/pci/host/pci-dra7xx.c | 10 -- > drivers/pci/host/pci-exynos.c | 10 -- > drivers/pci/host/pci-imx6.c | 10 -- > drivers/pci/host/pci-keystone.c | 10 -- > drivers/pci/host/pci-layerscape.c | 22 +- > drivers/pci/host/pcie-armada8k.c| 4 +++- > drivers/pci/host/pcie-designware-plat.c | 10 -- > drivers/pci/host/pcie-designware.c | 4 +++- > drivers/pci/host/pcie-designware.h | 2 +- > drivers/pci/host/pcie-qcom.c| 5 +++-- > drivers/pci/host/pcie-spear13xx.c | 10 -- > 11 files changed, 71 insertions(+), 26 deletions(-) > > diff --git a/drivers/pci/host/pci-dra7xx.c b/drivers/pci/host/pci-dra7xx.c > index 9595fad..811f0f9 100644 > --- a/drivers/pci/host/pci-dra7xx.c > +++ b/drivers/pci/host/pci-dra7xx.c > @@ -127,9 +127,10 @@ static void dra7xx_pcie_enable_interrupts(struct > dra7xx_pcie *dra7xx) > LEG_EP_INTERRUPTS); > } > > -static void dra7xx_pcie_host_init(struct pcie_port *pp) > +static int dra7xx_pcie_host_init(struct pcie_port *pp) > { > struct dra7xx_pcie *dra7xx = to_dra7xx_pcie(pp); > + int ret; > > pp->io_base &= DRA7XX_CPU_TO_BUS_ADDR; > pp->mem_base &= DRA7XX_CPU_TO_BUS_ADDR; > @@ -138,10 +139,15 @@ static void dra7xx_pcie_host_init(struct pcie_port *pp) > > dw_pcie_setup_rc(pp); > > - dra7xx_pcie_establish_link(dra7xx); > + ret = dra7xx_pcie_establish_link(dra7xx); > + if (ret < 0) > + return ret; > + > if (IS_ENABLED(CONFIG_PCI_MSI)) > dw_pcie_msi_init(pp); > dra7xx_pcie_enable_interrupts(dra7xx); > + > + return 0; > } > > static struct pcie_host_ops dra7xx_pcie_host_ops = { > diff --git a/drivers/pci/host/pci-exynos.c b/drivers/pci/host/pci-exynos.c > index f1c544b..c116fd9 100644 > --- a/drivers/pci/host/pci-exynos.c > +++ b/drivers/pci/host/pci-exynos.c > @@ -458,12 +458,18 @@ static int exynos_pcie_link_up(struct pcie_port *pp) > return 0; > } > > -static void exynos_pcie_host_init(struct pcie_port *pp) > +static int exynos_pcie_host_init(struct pcie_port *pp) > { > struct exynos_pcie *exynos_pcie = to_exynos_pcie(pp); > + int ret; > + > + ret = exynos_pcie_establish_link(exynos_pcie); > + if (ret < 0) > + return ret; > > - exynos_pcie_establish_link(exynos_pcie); > exynos_pcie_enable_interrupts(exynos_pcie); > + > + return 0; > } > > static struct pcie_host_ops exynos_pcie_host_ops = { > diff --git a/drivers/pci/host/pci-imx6.c b/drivers/pci/host/pci-imx6.c > index c8cefb0..1251e92 100644 > --- a/drivers/pci/host/pci-imx6.c > +++ b/drivers/pci/host/pci-imx6.c > @@ -550,18 +550,24 @@ static int imx6_pcie_establish_link(struct imx6_pcie > *imx6_pcie) > return ret; > } > > -static void imx6_pcie_host_init(struct pcie_port *pp) > +static int imx6_pcie_host_init(struct pcie_port *pp) > { > struct imx6_pcie *imx6_pcie = to_imx6_pcie(pp); > + int ret; > > imx6_pcie_assert_core_reset(imx6_pcie); >
[PATCHv2 1/4] pseries: Add hypercall wrappers for hash page table resizing
This adds the hypercall numbers and wrapper functions for the hash page table resizing hypercalls. These are experimental "platform specific" values for now, until we have a formal PAPR update. It also adds a new firmware feature flag to track the presence of the HPT resizing calls. Signed-off-by: David GibsonReviewed-by: Paul Mackerras --- arch/powerpc/include/asm/firmware.h | 5 +++-- arch/powerpc/include/asm/hvcall.h | 4 +++- arch/powerpc/include/asm/plpar_wrappers.h | 12 arch/powerpc/platforms/pseries/firmware.c | 1 + 4 files changed, 19 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h index 1e0b5a5..8645897 100644 --- a/arch/powerpc/include/asm/firmware.h +++ b/arch/powerpc/include/asm/firmware.h @@ -42,7 +42,7 @@ #define FW_FEATURE_SPLPAR ASM_CONST(0x0010) #define FW_FEATURE_LPARASM_CONST(0x0040) #define FW_FEATURE_PS3_LV1 ASM_CONST(0x0080) -/* FreeASM_CONST(0x0100) */ +#define FW_FEATURE_HPT_RESIZE ASM_CONST(0x0100) #define FW_FEATURE_CMO ASM_CONST(0x0200) #define FW_FEATURE_VPHNASM_CONST(0x0400) #define FW_FEATURE_XCMOASM_CONST(0x0800) @@ -66,7 +66,8 @@ enum { FW_FEATURE_MULTITCE | FW_FEATURE_SPLPAR | FW_FEATURE_LPAR | FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO | FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY | - FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN, + FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN | + FW_FEATURE_HPT_RESIZE, FW_FEATURE_PSERIES_ALWAYS = 0, FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL, FW_FEATURE_POWERNV_ALWAYS = 0, diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index 708edeb..9b7ff7c 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -275,7 +275,9 @@ #define H_COP 0x304 #define H_GET_MPP_X0x314 #define H_SET_MODE 0x31C -#define MAX_HCALL_OPCODE H_SET_MODE +#define H_RESIZE_HPT_PREPARE 0x36C +#define H_RESIZE_HPT_COMMIT0x370 +#define MAX_HCALL_OPCODE H_RESIZE_HPT_COMMIT /* H_VIOCTL functions */ #define H_GET_VIOA_DUMP_SIZE 0x01 diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h index 1b39424..b7ee6d9 100644 --- a/arch/powerpc/include/asm/plpar_wrappers.h +++ b/arch/powerpc/include/asm/plpar_wrappers.h @@ -242,6 +242,18 @@ static inline long plpar_pte_protect(unsigned long flags, unsigned long ptex, return plpar_hcall_norets(H_PROTECT, flags, ptex, avpn); } +static inline long plpar_resize_hpt_prepare(unsigned long flags, + unsigned long shift) +{ + return plpar_hcall_norets(H_RESIZE_HPT_PREPARE, flags, shift); +} + +static inline long plpar_resize_hpt_commit(unsigned long flags, + unsigned long shift) +{ + return plpar_hcall_norets(H_RESIZE_HPT_COMMIT, flags, shift); +} + static inline long plpar_tce_get(unsigned long liobn, unsigned long ioba, unsigned long *tce_ret) { diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c index ea7f09b..658c02d 100644 --- a/arch/powerpc/platforms/pseries/firmware.c +++ b/arch/powerpc/platforms/pseries/firmware.c @@ -64,6 +64,7 @@ hypertas_fw_features_table[] = { {FW_FEATURE_VPHN, "hcall-vphn"}, {FW_FEATURE_SET_MODE, "hcall-set-mode"}, {FW_FEATURE_BEST_ENERGY,"hcall-best-energy-1*"}, + {FW_FEATURE_HPT_RESIZE, "hcall-hpt-resize"}, }; /* Build up the firmware features bitmask using the contents of -- 2.9.3
[PATCHv2 2/4] pseries: Add support for hash table resizing
This adds support for using experimental hypercalls to change the size of the main hash page table while running as a PAPR guest. For now these hypercalls are only in experimental qemu versions. The interface is two part: first H_RESIZE_HPT_PREPARE is used to allocate and prepare the new hash table. This may be slow, but can be done asynchronously. Then, H_RESIZE_HPT_COMMIT is used to switch to the new hash table. This requires that no CPUs be concurrently updating the HPT, and so must be run under stop_machine(). This also adds a debugfs file which can be used to manually control HPT resizing or testing purposes. Signed-off-by: David GibsonReviewed-by: Paul Mackerras --- arch/powerpc/include/asm/book3s/64/mmu-hash.h | 1 + arch/powerpc/mm/hash_utils_64.c | 32 arch/powerpc/platforms/pseries/lpar.c | 110 ++ 3 files changed, 143 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h index e407af2..efba649 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h @@ -153,6 +153,7 @@ struct mmu_hash_ops { unsigned long addr, unsigned char *hpte_slot_array, int psize, int ssize, int local); + int (*resize_hpt)(unsigned long shift); /* * Special for kexec. * To be called in real mode with interrupts disabled. No locks are diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 78dabf06..61ce96c 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include @@ -1815,3 +1816,34 @@ void hash__setup_initial_memory_limit(phys_addr_t first_memblock_base, /* Finally limit subsequent allocations */ memblock_set_current_limit(ppc64_rma_size); } + +#ifdef CONFIG_DEBUG_FS + +static int ppc64_pft_size_get(void *data, u64 *val) +{ + *val = ppc64_pft_size; + return 0; +} + +static int ppc64_pft_size_set(void *data, u64 val) +{ + if (!mmu_hash_ops.resize_hpt) + return -ENODEV; + return mmu_hash_ops.resize_hpt(val); +} + +DEFINE_SIMPLE_ATTRIBUTE(fops_ppc64_pft_size, + ppc64_pft_size_get, ppc64_pft_size_set, "%llu\n"); + +static int __init hash64_debugfs(void) +{ + if (!debugfs_create_file("pft-size", 0600, powerpc_debugfs_root, +NULL, _ppc64_pft_size)) { + pr_err("lpar: unable to create ppc64_pft_size debugsfs file\n"); + } + + return 0; +} +machine_device_initcall(pseries, hash64_debugfs); + +#endif /* CONFIG_DEBUG_FS */ diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c index aa35245..5f0cee3 100644 --- a/arch/powerpc/platforms/pseries/lpar.c +++ b/arch/powerpc/platforms/pseries/lpar.c @@ -27,6 +27,8 @@ #include #include #include +#include +#include #include #include #include @@ -589,6 +591,113 @@ static int __init disable_bulk_remove(char *str) __setup("bulk_remove=", disable_bulk_remove); +#define HPT_RESIZE_TIMEOUT 1 /* ms */ + +struct hpt_resize_state { + unsigned long shift; + int commit_rc; +}; + +static int pseries_lpar_resize_hpt_commit(void *data) +{ + struct hpt_resize_state *state = data; + + state->commit_rc = plpar_resize_hpt_commit(0, state->shift); + if (state->commit_rc != H_SUCCESS) + return -EIO; + + /* Hypervisor has transitioned the HTAB, update our globals */ + ppc64_pft_size = state->shift; + htab_size_bytes = 1UL << ppc64_pft_size; + htab_hash_mask = (htab_size_bytes >> 7) - 1; + + return 0; +} + +/* Must be called in user context */ +static int pseries_lpar_resize_hpt(unsigned long shift) +{ + struct hpt_resize_state state = { + .shift = shift, + .commit_rc = H_FUNCTION, + }; + unsigned int delay, total_delay = 0; + int rc; + ktime_t t0, t1, t2; + + might_sleep(); + + if (!firmware_has_feature(FW_FEATURE_HPT_RESIZE)) + return -ENODEV; + + printk(KERN_INFO "lpar: Attempting to resize HPT to shift %lu\n", + shift); + + t0 = ktime_get(); + + rc = plpar_resize_hpt_prepare(0, shift); + while (H_IS_LONG_BUSY(rc)) { + delay = get_longbusy_msecs(rc); + total_delay += delay; + if (total_delay > HPT_RESIZE_TIMEOUT) { + /* prepare call with shift==0 cancels an +* in-progress resize */ + rc = plpar_resize_hpt_prepare(0, 0); + if (rc !=
[PATCHv2 4/4] pseries: Automatically resize HPT for memory hot add/remove
We've now implemented code in the pseries platform to use the new PAPR interface to allow resizing the hash page table (HPT) at runtime. This patch uses that interface to automatically attempt to resize the HPT when memory is hot added or removed. This tries to always keep the HPT at a reasonable size for our current memory size. Signed-off-by: David GibsonReviewed-by: Paul Mackerras --- arch/powerpc/include/asm/sparsemem.h | 1 + arch/powerpc/mm/hash_utils_64.c | 29 + arch/powerpc/mm/mem.c| 4 3 files changed, 34 insertions(+) diff --git a/arch/powerpc/include/asm/sparsemem.h b/arch/powerpc/include/asm/sparsemem.h index f6fc0ee..737335c 100644 --- a/arch/powerpc/include/asm/sparsemem.h +++ b/arch/powerpc/include/asm/sparsemem.h @@ -16,6 +16,7 @@ #endif /* CONFIG_SPARSEMEM */ #ifdef CONFIG_MEMORY_HOTPLUG +extern void resize_hpt_for_hotplug(unsigned long new_mem_size); extern int create_section_mapping(unsigned long start, unsigned long end); extern int remove_section_mapping(unsigned long start, unsigned long end); #ifdef CONFIG_NUMA diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 61ce96c..abb4301 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -748,6 +748,35 @@ static unsigned long __init htab_get_table_size(void) } #ifdef CONFIG_MEMORY_HOTPLUG +void resize_hpt_for_hotplug(unsigned long new_mem_size) +{ + unsigned target_hpt_shift; + + if (!mmu_hash_ops.resize_hpt) + return; + + target_hpt_shift = htab_shift_for_mem_size(new_mem_size); + + /* +* To avoid lots of HPT resizes if memory size is fluctuating +* across a boundary, we deliberately have some hysterisis +* here: we immediately increase the HPT size if the target +* shift exceeds the current shift, but we won't attempt to +* reduce unless the target shift is at least 2 below the +* current shift +*/ + if ((target_hpt_shift > ppc64_pft_size) + || (target_hpt_shift < (ppc64_pft_size - 1))) { + int rc; + + rc = mmu_hash_ops.resize_hpt(target_hpt_shift); + if (rc) + printk(KERN_WARNING + "Unable to resize hash page table to target order %d: %d\n", + target_hpt_shift, rc); + } +} + int create_section_mapping(unsigned long start, unsigned long end) { int rc = htab_bolt_mapping(start, end, __pa(start), diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 5f84433..9ee536e 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -134,6 +134,8 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device) unsigned long nr_pages = size >> PAGE_SHIFT; int rc; + resize_hpt_for_hotplug(memblock_phys_mem_size()); + pgdata = NODE_DATA(nid); start = (unsigned long)__va(start); @@ -174,6 +176,8 @@ int arch_remove_memory(u64 start, u64 size) */ vm_unmap_aliases(); + resize_hpt_for_hotplug(memblock_phys_mem_size()); + return ret; } #endif -- 2.9.3
[PATCHv2 3/4] pseries: Advertise HPT resizing support via CAS
The hypervisor needs to know a guest is capable of using the HPT resizing PAPR extension in order to make full advantage of it for memory hotplug. If the hypervisor knows the guest is HPT resize aware, it can size the initial HPT based on the initial guest RAM size, relying on the guest to resize the HPT when more memory is hot-added. Without this, the hypervisor must size the HPT for the maximum possible guest RAM, which can lead to a huge waste of space if the guest never actually expends to that maximum size. This patch advertises the guest's support for HPT resizing via the ibm,client-architecture-support OF interface. We use bit 5 of byte 6 of option vector 5 for this purpose (tentatively assigned in in-progress PAPR change request). Signed-off-by: David GibsonReviewed-by: Anshuman Khandual Reviewed-by: Paul Mackerras --- arch/powerpc/include/asm/prom.h | 1 + arch/powerpc/kernel/prom_init.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h index 7f436ba..94c92bb 100644 --- a/arch/powerpc/include/asm/prom.h +++ b/arch/powerpc/include/asm/prom.h @@ -151,6 +151,7 @@ struct of_drconf_cell { #define OV5_XCMO 0x0440 /* Page Coalescing */ #define OV5_TYPE1_AFFINITY 0x0580 /* Type 1 NUMA affinity */ #define OV5_PRRN 0x0540 /* Platform Resource Reassignment */ +#define OV5_RESIZE_HPT 0x0601 /* Hash Page Table resizing */ #define OV5_PFO_HW_RNG 0x0E80 /* PFO Random Number Generator */ #define OV5_PFO_HW_842 0x0E40 /* PFO Compression Accelerator */ #define OV5_PFO_HW_ENCR0x0E20 /* PFO Encryption Accelerator */ diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c index 88ac964..9942d9f 100644 --- a/arch/powerpc/kernel/prom_init.c +++ b/arch/powerpc/kernel/prom_init.c @@ -713,7 +713,7 @@ unsigned char ibm_architecture_vec[] = { 0, #endif OV5_FEAT(OV5_TYPE1_AFFINITY) | OV5_FEAT(OV5_PRRN), - 0, + OV5_FEAT(OV5_RESIZE_HPT), 0, 0, /* WARNING: The offset of the "number of cores" field below -- 2.9.3
[PATCHv2 0/4] Hash Page Table resizing for PAPR guests
This series implements the guest side of a PAPR ACR which allows a POWER guest's Hashed Page Table (HPT) to be resized at runtime. This is useful when a guest has a very large theoretical maximum RAM, but is likely to only ever be expanded to a modest amount of RAM in practice. Without resizing the HPT has to be sized for the maximum possible guest RAM, which can be very wasteful if that maximum is never reached. To use this requires a hypervisor/host which also supports the PAPR extension. The only implementation so far is my qemu branch at https://github.com/dgibson/qemu/tree/upstream/hpt-resize I expect to merge that code to upstream qemu for qemu-2.9. Note that HPT resizing will so far only work for TCG guests, KVM support is in the works. The guest side code here will not require changing for KVM, however. An HPT resize can be triggered in one of two ways: * /sys/kernel/debug/powerpc/pft-size This debugfs file contains the current size of the HPT (as encoded in the ibm,pft-size) device tree property. Writing to it will cause the guest to attempt an HPT resize to the given value. Note that the current qemu implementation will not allow the guest to resize the HPT to more than 1/64th of guest RAM size. * Automatically on memory hotplug / unplug With these patches applied, the guest will automatically attempt to resize its HPT when its RAM size changes due to hotplug events. (When hot adding RAM qemu considers the new size for purposes of the limit mentioned above, so this method can lift the HPT size larger than the former). Changes since v1: * Remove two patches which belong with the (upcoming) host side support rather than guest side (and therefore also should go via the kvm tree instead of the powerpc tree) * Protected the debugfs code with #ifdef CONFIG_DEBUG_FS. Couldn't actually get a compile error myself, but the KVM buildbot did get one. David Gibson (4): pseries: Add hypercall wrappers for hash page table resizing pseries: Add support for hash table resizing pseries: Advertise HPT resizing support via CAS pseries: Automatically resize HPT for memory hot add/remove arch/powerpc/include/asm/book3s/64/mmu-hash.h | 1 + arch/powerpc/include/asm/firmware.h | 5 +- arch/powerpc/include/asm/hvcall.h | 4 +- arch/powerpc/include/asm/plpar_wrappers.h | 12 +++ arch/powerpc/include/asm/prom.h | 1 + arch/powerpc/include/asm/sparsemem.h | 1 + arch/powerpc/kernel/prom_init.c | 2 +- arch/powerpc/mm/hash_utils_64.c | 61 ++ arch/powerpc/mm/mem.c | 4 + arch/powerpc/platforms/pseries/firmware.c | 1 + arch/powerpc/platforms/pseries/lpar.c | 110 ++ 11 files changed, 198 insertions(+), 4 deletions(-) -- 2.9.3
Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing
On 08/12/2016 00:31, Tyrel Datwyler wrote: > The first byte of each CRQ entry is used to indicate whether an entry is > a valid response or free for the VIOS to use. After processing a > response the driver sets the valid byte to zero to indicate the entry is > now free to be reused. Add a memory barrier after this write to ensure > no other stores are reordered when updating the valid byte. > > Signed-off-by: Tyrel Datwyler> --- > drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c > b/drivers/scsi/ibmvscsi/ibmvscsi.c > index d9534ee..2f5b07e 100644 > --- a/drivers/scsi/ibmvscsi/ibmvscsi.c > +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c > @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data) > while ((crq = crq_queue_next_crq(>queue)) != NULL) { > ibmvscsi_handle_crq(crq, hostdata); > crq->valid = VIOSRP_CRQ_FREE; > + wmb(); > } > > vio_enable_interrupts(vdev); > @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data) > vio_disable_interrupts(vdev); > ibmvscsi_handle_crq(crq, hostdata); > crq->valid = VIOSRP_CRQ_FREE; > + wmb(); Should this driver use virt_wmb instead? Paolo > } else { > done = 1; > } >
Re: [PATCH 1/2] ibmvscsi: add vscsi hosts to global list_head
> "Tyrel" == Tyrel Datwylerwrites: Tyrel> Add each vscsi host adatper to a new global list_head named Tyrel> ibmvscsi_head. There is no functional change. This is meant Tyrel> primarily as a convience for locating adatpers from within the Tyrel> debugger or crash utility. Applied 1+2 to 4.10/scsi-queue. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH] powerpc: Fix LPCR_VRMASD definition
On Thu, Dec 08, 2016 at 11:29:30AM +0800, Jia He wrote: > Fixes: a4b349540a ("powerpc/mm: Cleanup LPCR defines") > Signed-off-by: Jia HeAcked-by: Paul Mackerras
Re: [PATCH] powerpc/mm: Fixup wrong LPCR_VRMASD value
On Thu, Dec 08, 2016 at 09:12:13AM +0530, Aneesh Kumar K.V wrote: > In commit a4b349540a26af ("powerpc/mm: Cleanup LPCR defines") we updated > LPCR_VRMASD wrongly as below. > > -#define LPCR_VRMASD (0x1ful << (63-16)) > +#define LPCR_VRMASD_SH 47 > +#define LPCR_VRMASD (ASM_CONST(1) << LPCR_VRMASD_SH) > > We initialize the VRMA bits in LPCR to 0x00 in kvm. Hence using a different > mask value as above while updating lpcr should not have any impact. > > This patch updates it to the correct value > Fixes: a4b349540a26af ("powerpc/mm: Cleanup LPCR defines") we updated > > Reported-by: Ram Pai> Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/include/asm/reg.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h > index 9e1499f98def..1c17e208db78 100644 > --- a/arch/powerpc/include/asm/reg.h > +++ b/arch/powerpc/include/asm/reg.h > @@ -337,7 +337,7 @@ > #define LPCR_DPFD_SH 52 > #define LPCR_DPFD (ASM_CONST(7) << LPCR_DPFD_SH) > #define LPCR_VRMASD_SH 47 > -#define LPCR_VRMASD(ASM_CONST(1) << LPCR_VRMASD_SH) > +#define LPCR_VRMASD(ASM_CONST(1f) << LPCR_VRMASD_SH) Don't you need an 0x in there? Did you compile-test this? Paul.
Re: [PATCH] powerpc/mm: Fixup wrong LPCR_VRMASD value
On Thu, Dec 08, 2016 at 09:12:13AM +0530, Aneesh Kumar K.V wrote: > In commit a4b349540a26af ("powerpc/mm: Cleanup LPCR defines") we updated > LPCR_VRMASD wrongly as below. > > -#define LPCR_VRMASD (0x1ful << (63-16)) > +#define LPCR_VRMASD_SH 47 > +#define LPCR_VRMASD (ASM_CONST(1) << LPCR_VRMASD_SH) > > We initialize the VRMA bits in LPCR to 0x00 in kvm. Hence using a different > mask value as above while updating lpcr should not have any impact. > > This patch updates it to the correct value > Fixes: a4b349540a26af ("powerpc/mm: Cleanup LPCR defines") we updated > > Reported-by: Ram Paiactually this was reported by He Jia. > Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/include/asm/reg.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h > index 9e1499f98def..1c17e208db78 100644 > --- a/arch/powerpc/include/asm/reg.h > +++ b/arch/powerpc/include/asm/reg.h > @@ -337,7 +337,7 @@ > #define LPCR_DPFD_SH 52 > #define LPCR_DPFD (ASM_CONST(7) << LPCR_DPFD_SH) > #define LPCR_VRMASD_SH 47 > -#define LPCR_VRMASD(ASM_CONST(1) << LPCR_VRMASD_SH) > +#define LPCR_VRMASD(ASM_CONST(1f) << LPCR_VRMASD_SH) Shouldn't this be 0x1f instead of 1f ? RP
Re: [PATCH 4/6] pseries: Add support for hash table resizing
Hi David, [auto build test ERROR on v4.9-rc8] [cannot apply to powerpc/next kvm/linux-next kvm-ppc/kvm-ppc-next next-20161208] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/David-Gibson/powerpc-Hash-Page-Table-resizing-for-PAPR-guests/20161208-145142 config: powerpc-ps3_defconfig (attached as .config) compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=powerpc All errors (new ones prefixed by >>): arch/powerpc/mm/hash_utils_64.c: In function 'hash64_debugfs': >> arch/powerpc/mm/hash_utils_64.c:1838:45: error: 'powerpc_debugfs_root' >> undeclared (first use in this function) if (!debugfs_create_file("pft-size", 0600, powerpc_debugfs_root, ^~~~ arch/powerpc/mm/hash_utils_64.c:1838:45: note: each undeclared identifier is reported only once for each function it appears in vim +/powerpc_debugfs_root +1838 arch/powerpc/mm/hash_utils_64.c 1832 1833 DEFINE_SIMPLE_ATTRIBUTE(fops_ppc64_pft_size, 1834 ppc64_pft_size_get, ppc64_pft_size_set, "%llu\n"); 1835 1836 static int __init hash64_debugfs(void) 1837 { > 1838 if (!debugfs_create_file("pft-size", 0600, powerpc_debugfs_root, 1839 NULL, _ppc64_pft_size)) { 1840 pr_err("lpar: unable to create ppc64_pft_size debugsfs file\n"); 1841 } --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH v3 04/15] livepatch/x86: add TIF_PATCH_PENDING thread flag
On Thu, Dec 8, 2016 at 10:08 AM, Josh Poimboeufwrote: > Add the TIF_PATCH_PENDING thread flag to enable the new livepatch > per-task consistency model for x86_64. The bit getting set indicates > the thread has a pending patch which needs to be applied when the thread > exits the kernel. > > The bit is placed in the _TIF_ALLWORK_MASK macro, which results in > exit_to_usermode_loop() calling klp_update_patch_state() when it's set. > > Signed-off-by: Josh Poimboeuf Acked-by: Andy Lutomirski
[PATCH v3 15/15] livepatch: allow removal of a disabled patch
From: Miroslav BenesCurrently we do not allow patch module to unload since there is no method to determine if a task is still running in the patched code. The consistency model gives us the way because when the unpatching finishes we know that all tasks were marked as safe to call an original function. Thus every new call to the function calls the original code and at the same time no task can be somewhere in the patched code, because it had to leave that code to be marked as safe. We can safely let the patch module go after that. Completion is used for synchronization between module removal and sysfs infrastructure in a similar way to commit 942e443127e9 ("module: Fix mod->mkobj.kobj potentially freed too early"). Note that we still do not allow the removal for immediate model, that is no consistency model. The module refcount may increase in this case if somebody disables and enables the patch several times. This should not cause any harm. With this change a call to try_module_get() is moved to __klp_enable_patch from klp_register_patch to make module reference counting symmetric (module_put() is in a patch disable path) and to allow to take a new reference to a disabled module when being enabled. Also all kobject_put(>kobj) calls are moved outside of klp_mutex lock protection to prevent a deadlock situation when klp_unregister_patch is called and sysfs directories are removed. There is no need to do the same for other kobject_put() callsites as we currently do not have their sysfs counterparts. Signed-off-by: Miroslav Benes Signed-off-by: Josh Poimboeuf --- Documentation/livepatch/livepatch.txt | 29 - include/linux/livepatch.h | 3 ++ kernel/livepatch/core.c | 80 ++- kernel/livepatch/transition.c | 12 +- samples/livepatch/livepatch-sample.c | 1 - 5 files changed, 72 insertions(+), 53 deletions(-) diff --git a/Documentation/livepatch/livepatch.txt b/Documentation/livepatch/livepatch.txt index f87e742..b0eaaf8 100644 --- a/Documentation/livepatch/livepatch.txt +++ b/Documentation/livepatch/livepatch.txt @@ -265,8 +265,15 @@ section "Livepatch life-cycle" below for more details about these two operations. Module removal is only safe when there are no users of the underlying -functions. The immediate consistency model is not able to detect this; -therefore livepatch modules cannot be removed. See "Limitations" below. +functions. The immediate consistency model is not able to detect this. The +code just redirects the functions at the very beginning and it does not +check if the functions are in use. In other words, it knows when the +functions get called but it does not know when the functions return. +Therefore it cannot be decided when the livepatch module can be safely +removed. This is solved by a hybrid consistency model. When the system is +transitioned to a new patch state (patched/unpatched) it is guaranteed that +no task sleeps or runs in the old code. + 5. Livepatch life-cycle === @@ -437,24 +444,6 @@ The current Livepatch implementation has several limitations: There is work in progress to remove this limitation. - + Livepatch modules can not be removed. - -The current implementation just redirects the functions at the very -beginning. It does not check if the functions are in use. In other -words, it knows when the functions get called but it does not -know when the functions return. Therefore it can not decide when -the livepatch module can be safely removed. - -This will get most likely solved once a more complex consistency model -is supported. The idea is that a safe state for patching should also -mean a safe state for removing the patch. - -Note that the patch itself might get disabled by writing zero -to /sys/kernel/livepatch//enabled. It causes that the new -code will not longer get called. But it does not guarantee -that anyone is not sleeping anywhere in the new code. - - + Livepatch works reliably only when the dynamic ftrace is located at the very beginning of the function. diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index 8e06fe5..1959e52 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -23,6 +23,7 @@ #include #include +#include #if IS_ENABLED(CONFIG_LIVEPATCH) @@ -114,6 +115,7 @@ struct klp_object { * @list: list node for global list of registered patches * @kobj: kobject for sysfs resources * @enabled: the patch is enabled (but operation may be incomplete) + * @finish:for waiting till it is safe to remove the patch module */ struct klp_patch { /* external */ @@ -125,6 +127,7 @@ struct klp_patch { struct list_head list; struct kobject kobj; bool enabled; + struct completion finish; }; #define
[PATCH v3 14/15] livepatch: add /proc//patch_state
Expose the per-task patch state value so users can determine which tasks are holding up completion of a patching operation. Signed-off-by: Josh Poimboeuf--- Documentation/filesystems/proc.txt | 18 ++ fs/proc/base.c | 15 +++ 2 files changed, 33 insertions(+) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 72624a1..85c501b 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -44,6 +44,7 @@ Table of Contents 3.8 /proc//fdinfo/ - Information about opened file 3.9 /proc//map_files - Information about memory mapped files 3.10 /proc//timerslack_ns - Task timerslack value + 3.11 /proc//patch_state - Livepatch patch operation state 4Configuring procfs 4.1 Mount options @@ -1886,6 +1887,23 @@ Valid values are from 0 - ULLONG_MAX An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level permissions on the task specified to change its timerslack_ns value. +3.11 /proc//patch_state - Livepatch patch operation state +- +When CONFIG_LIVEPATCH is enabled, this file displays the value of the +patch state for the task. + +A value of '-1' indicates that no patch is in transition. + +A value of '0' indicates that a patch is in transition and the task is +unpatched. If the patch is being enabled, then the task hasn't been +patched yet. If the patch is being disabled, then the task has already +been unpatched. + +A value of '1' indicates that a patch is in transition and the task is +patched. If the patch is being enabled, then the task has already been +patched. If the patch is being disabled, then the task hasn't been +unpatched yet. + -- Configuring procfs diff --git a/fs/proc/base.c b/fs/proc/base.c index 5ea8363..2e1e012 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2841,6 +2841,15 @@ static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns, return err; } +#ifdef CONFIG_LIVEPATCH +static int proc_pid_patch_state(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task) +{ + seq_printf(m, "%d\n", task->patch_state); + return 0; +} +#endif /* CONFIG_LIVEPATCH */ + /* * Thread groups */ @@ -2940,6 +2949,9 @@ static const struct pid_entry tgid_base_stuff[] = { REG("timers", S_IRUGO, proc_timers_operations), #endif REG("timerslack_ns", S_IRUGO|S_IWUGO, proc_pid_set_timerslack_ns_operations), +#ifdef CONFIG_LIVEPATCH + ONE("patch_state", S_IRUSR, proc_pid_patch_state), +#endif }; static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx) @@ -3320,6 +3332,9 @@ static const struct pid_entry tid_base_stuff[] = { REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations), REG("setgroups", S_IRUGO|S_IWUSR, proc_setgroups_operations), #endif +#ifdef CONFIG_LIVEPATCH + ONE("patch_state", S_IRUSR, proc_pid_patch_state), +#endif }; static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx) -- 2.7.4
[PATCH v3 13/15] livepatch: change to a per-task consistency model
Change livepatch to use a basic per-task consistency model. This is the foundation which will eventually enable us to patch those ~10% of security patches which change function or data semantics. This is the biggest remaining piece needed to make livepatch more generally useful. This code stems from the design proposal made by Vojtech [1] in November 2014. It's a hybrid of kGraft and kpatch: it uses kGraft's per-task consistency and syscall barrier switching combined with kpatch's stack trace switching. There are also a number of fallback options which make it quite flexible. Patches are applied on a per-task basis, when the task is deemed safe to switch over. When a patch is enabled, livepatch enters into a transition state where tasks are converging to the patched state. Usually this transition state can complete in a few seconds. The same sequence occurs when a patch is disabled, except the tasks converge from the patched state to the unpatched state. An interrupt handler inherits the patched state of the task it interrupts. The same is true for forked tasks: the child inherits the patched state of the parent. Livepatch uses several complementary approaches to determine when it's safe to patch tasks: 1. The first and most effective approach is stack checking of sleeping tasks. If no affected functions are on the stack of a given task, the task is patched. In most cases this will patch most or all of the tasks on the first try. Otherwise it'll keep trying periodically. This option is only available if the architecture has reliable stacks (HAVE_RELIABLE_STACKTRACE). 2. The second approach, if needed, is kernel exit switching. A task is switched when it returns to user space from a system call, a user space IRQ, or a signal. It's useful in the following cases: a) Patching I/O-bound user tasks which are sleeping on an affected function. In this case you have to send SIGSTOP and SIGCONT to force it to exit the kernel and be patched. b) Patching CPU-bound user tasks. If the task is highly CPU-bound then it will get patched the next time it gets interrupted by an IRQ. c) In the future it could be useful for applying patches for architectures which don't yet have HAVE_RELIABLE_STACKTRACE. In this case you would have to signal most of the tasks on the system. However this isn't supported yet because there's currently no way to patch kthreads without HAVE_RELIABLE_STACKTRACE. 3. For idle "swapper" tasks, since they don't ever exit the kernel, they instead have a klp_update_patch_state() call in the idle loop which allows them to be patched before the CPU enters the idle state. (Note there's not yet such an approach for kthreads.) All the above approaches may be skipped by setting the 'immediate' flag in the 'klp_patch' struct, which will disable per-task consistency and patch all tasks immediately. This can be useful if the patch doesn't change any function or data semantics. Note that, even with this flag set, it's possible that some tasks may still be running with an old version of the function, until that function returns. There's also an 'immediate' flag in the 'klp_func' struct which allows you to specify that certain functions in the patch can be applied without per-task consistency. This might be useful if you want to patch a common function like schedule(), and the function change doesn't need consistency but the rest of the patch does. For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user must set patch->immediate which causes all tasks to be patched immediately. This option should be used with care, only when the patch doesn't change any function or data semantics. In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE may be allowed to use per-task consistency if we can come up with another way to patch kthreads. The /sys/kernel/livepatch//transition file shows whether a patch is in transition. Only a single patch (the topmost patch on the stack) can be in transition at a given time. A patch can remain in transition indefinitely, if any of the tasks are stuck in the initial patch state. A transition can be reversed and effectively canceled by writing the opposite value to the /sys/kernel/livepatch//enabled file while the transition is in progress. Then all the tasks will attempt to converge back to the original patch state. [1] https://lkml.kernel.org/r/20141107140458.ga21...@suse.cz Signed-off-by: Josh Poimboeuf--- Documentation/ABI/testing/sysfs-kernel-livepatch | 8 + Documentation/livepatch/livepatch.txt| 127 +- include/linux/init_task.h| 9 + include/linux/livepatch.h| 40 +- include/linux/sched.h| 3 + kernel/fork.c| 3 + kernel/livepatch/Makefile|
[PATCH v3 12/15] livepatch: store function sizes
For the consistency model we'll need to know the sizes of the old and new functions to determine if they're on the stacks of any tasks. Signed-off-by: Josh Poimboeuf--- include/linux/livepatch.h | 3 +++ kernel/livepatch/core.c | 16 2 files changed, 19 insertions(+) diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index 1e2eb91..1a5a93c 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -37,6 +37,8 @@ * @old_addr: the address of the function being patched * @kobj: kobject for sysfs resources * @stack_node:list node for klp_ops func_stack list + * @old_size: size of the old function + * @new_size: size of the new function * @patched: the func has been added to the klp_ops list */ struct klp_func { @@ -56,6 +58,7 @@ struct klp_func { unsigned long old_addr; struct kobject kobj; struct list_head stack_node; + unsigned long old_size, new_size; bool patched; }; diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 8ca8a0e..fc160c6 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -584,6 +584,22 @@ static int klp_init_object_loaded(struct klp_patch *patch, >old_addr); if (ret) return ret; + + ret = kallsyms_lookup_size_offset(func->old_addr, + >old_size, NULL); + if (!ret) { + pr_err("kallsyms size lookup failed for '%s'\n", + func->old_name); + return -ENOENT; + } + + ret = kallsyms_lookup_size_offset((unsigned long)func->new_func, + >new_size, NULL); + if (!ret) { + pr_err("kallsyms size lookup failed for '%s' replacement\n", + func->old_name); + return -ENOENT; + } } return 0; -- 2.7.4
[PATCH v3 11/15] livepatch: use kstrtobool() in enabled_store()
The sysfs enabled value is a boolean, so kstrtobool() is a better fit for parsing the input string since it does the range checking for us. Suggested-by: Petr MladekSigned-off-by: Josh Poimboeuf --- kernel/livepatch/core.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 6a137e1..8ca8a0e 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -408,26 +408,23 @@ static ssize_t enabled_store(struct kobject *kobj, struct kobj_attribute *attr, { struct klp_patch *patch; int ret; - unsigned long val; + bool enabled; - ret = kstrtoul(buf, 10, ); + ret = kstrtobool(buf, ); if (ret) return -EINVAL; - if (val > 1) - return -EINVAL; - patch = container_of(kobj, struct klp_patch, kobj); mutex_lock(_mutex); - if (patch->enabled == val) { + if (patch->enabled == enabled) { /* already in requested state */ ret = -EINVAL; goto err; } - if (val) { + if (enabled) { ret = __klp_enable_patch(patch); if (ret) goto err; -- 2.7.4
[PATCH v3 10/15] livepatch: move patching functions into patch.c
Move functions related to the actual patching of functions and objects into a new patch.c file. Signed-off-by: Josh Poimboeuf--- kernel/livepatch/Makefile | 2 +- kernel/livepatch/core.c | 202 +-- kernel/livepatch/patch.c | 213 ++ kernel/livepatch/patch.h | 32 +++ 4 files changed, 247 insertions(+), 202 deletions(-) create mode 100644 kernel/livepatch/patch.c create mode 100644 kernel/livepatch/patch.h diff --git a/kernel/livepatch/Makefile b/kernel/livepatch/Makefile index e8780c0..e136dad 100644 --- a/kernel/livepatch/Makefile +++ b/kernel/livepatch/Makefile @@ -1,3 +1,3 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o -livepatch-objs := core.o +livepatch-objs := core.o patch.o diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 47ed643..6a137e1 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -24,32 +24,13 @@ #include #include #include -#include #include #include #include #include #include #include - -/** - * struct klp_ops - structure for tracking registered ftrace ops structs - * - * A single ftrace_ops is shared between all enabled replacement functions - * (klp_func structs) which have the same old_addr. This allows the switch - * between function versions to happen instantaneously by updating the klp_ops - * struct's func_stack list. The winner is the klp_func at the top of the - * func_stack (front of the list). - * - * @node: node for the global klp_ops list - * @func_stack:list head for the stack of klp_func's (active func is on top) - * @fops: registered ftrace ops struct - */ -struct klp_ops { - struct list_head node; - struct list_head func_stack; - struct ftrace_ops fops; -}; +#include "patch.h" /* * The klp_mutex protects the global lists and state transitions of any @@ -60,28 +41,12 @@ struct klp_ops { static DEFINE_MUTEX(klp_mutex); static LIST_HEAD(klp_patches); -static LIST_HEAD(klp_ops); static struct kobject *klp_root_kobj; /* TODO: temporary stub */ void klp_update_patch_state(struct task_struct *task) {} -static struct klp_ops *klp_find_ops(unsigned long old_addr) -{ - struct klp_ops *ops; - struct klp_func *func; - - list_for_each_entry(ops, _ops, node) { - func = list_first_entry(>func_stack, struct klp_func, - stack_node); - if (func->old_addr == old_addr) - return ops; - } - - return NULL; -} - static bool klp_is_module(struct klp_object *obj) { return obj->name; @@ -314,171 +279,6 @@ static int klp_write_object_relocations(struct module *pmod, return ret; } -static void notrace klp_ftrace_handler(unsigned long ip, - unsigned long parent_ip, - struct ftrace_ops *fops, - struct pt_regs *regs) -{ - struct klp_ops *ops; - struct klp_func *func; - - ops = container_of(fops, struct klp_ops, fops); - - rcu_read_lock(); - func = list_first_or_null_rcu(>func_stack, struct klp_func, - stack_node); - if (WARN_ON_ONCE(!func)) - goto unlock; - - klp_arch_set_pc(regs, (unsigned long)func->new_func); -unlock: - rcu_read_unlock(); -} - -/* - * Convert a function address into the appropriate ftrace location. - * - * Usually this is just the address of the function, but on some architectures - * it's more complicated so allow them to provide a custom behaviour. - */ -#ifndef klp_get_ftrace_location -static unsigned long klp_get_ftrace_location(unsigned long faddr) -{ - return faddr; -} -#endif - -static void klp_unpatch_func(struct klp_func *func) -{ - struct klp_ops *ops; - - if (WARN_ON(!func->patched)) - return; - if (WARN_ON(!func->old_addr)) - return; - - ops = klp_find_ops(func->old_addr); - if (WARN_ON(!ops)) - return; - - if (list_is_singular(>func_stack)) { - unsigned long ftrace_loc; - - ftrace_loc = klp_get_ftrace_location(func->old_addr); - if (WARN_ON(!ftrace_loc)) - return; - - WARN_ON(unregister_ftrace_function(>fops)); - WARN_ON(ftrace_set_filter_ip(>fops, ftrace_loc, 1, 0)); - - list_del_rcu(>stack_node); - list_del(>node); - kfree(ops); - } else { - list_del_rcu(>stack_node); - } - - func->patched = false; -} - -static int klp_patch_func(struct klp_func *func) -{ - struct klp_ops *ops; - int ret; - - if (WARN_ON(!func->old_addr)) - return -EINVAL; - - if (WARN_ON(func->patched)) - return -EINVAL; - -
[PATCH v3 09/15] livepatch: remove unnecessary object loaded check
klp_patch_object()'s callers already ensure that the object is loaded, so its call to klp_is_object_loaded() is unnecessary. This will also make it possible to move the patching code into a separate file. Signed-off-by: Josh Poimboeuf--- kernel/livepatch/core.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 2dbd355..47ed643 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -467,9 +467,6 @@ static int klp_patch_object(struct klp_object *obj) if (WARN_ON(obj->patched)) return -EINVAL; - if (WARN_ON(!klp_is_object_loaded(obj))) - return -EINVAL; - klp_for_each_func(obj, func) { ret = klp_patch_func(func); if (ret) { -- 2.7.4
[PATCH v3 03/15] livepatch: temporary stubs for klp_patch_pending() and klp_update_patch_state()
Create temporary stubs for klp_patch_pending() and klp_update_patch_state() so we can add TIF_PATCH_PENDING to different architectures in separate patches without breaking build bisectability. Signed-off-by: Josh Poimboeuf--- include/linux/livepatch.h | 7 ++- kernel/livepatch/core.c | 3 +++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index 9072f04..60558d8 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -123,10 +123,15 @@ void arch_klp_init_object_loaded(struct klp_patch *patch, int klp_module_coming(struct module *mod); void klp_module_going(struct module *mod); +static inline bool klp_patch_pending(struct task_struct *task) { return false; } +void klp_update_patch_state(struct task_struct *task); + #else /* !CONFIG_LIVEPATCH */ static inline int klp_module_coming(struct module *mod) { return 0; } -static inline void klp_module_going(struct module *mod) { } +static inline void klp_module_going(struct module *mod) {} +static inline bool klp_patch_pending(struct task_struct *task) { return false; } +static inline void klp_update_patch_state(struct task_struct *task) {} #endif /* CONFIG_LIVEPATCH */ diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index af46438..217b39d 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -64,6 +64,9 @@ static LIST_HEAD(klp_ops); static struct kobject *klp_root_kobj; +/* TODO: temporary stub */ +void klp_update_patch_state(struct task_struct *task) {} + static struct klp_ops *klp_find_ops(unsigned long old_addr) { struct klp_ops *ops; -- 2.7.4
[PATCH v3 08/15] livepatch: separate enabled and patched states
Once we have a consistency model, patches and their objects will be enabled and disabled at different times. For example, when a patch is disabled, its loaded objects' funcs can remain registered with ftrace indefinitely until the unpatching operation is complete and they're no longer in use. It's less confusing if we give them different names: patches can be enabled or disabled; objects (and their funcs) can be patched or unpatched: - Enabled means that a patch is logically enabled (but not necessarily fully applied). - Patched means that an object's funcs are registered with ftrace and added to the klp_ops func stack. Also, since these states are binary, represent them with booleans instead of ints. Signed-off-by: Josh Poimboeuf--- include/linux/livepatch.h | 17 --- kernel/livepatch/core.c | 72 +++ 2 files changed, 42 insertions(+), 47 deletions(-) diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index 60558d8..1e2eb91 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -28,11 +28,6 @@ #include -enum klp_state { - KLP_DISABLED, - KLP_ENABLED -}; - /** * struct klp_func - function structure for live patching * @old_name: name of the function to be patched @@ -41,8 +36,8 @@ enum klp_state { * can be found (optional) * @old_addr: the address of the function being patched * @kobj: kobject for sysfs resources - * @state: tracks function-level patch application state * @stack_node:list node for klp_ops func_stack list + * @patched: the func has been added to the klp_ops list */ struct klp_func { /* external */ @@ -60,8 +55,8 @@ struct klp_func { /* internal */ unsigned long old_addr; struct kobject kobj; - enum klp_state state; struct list_head stack_node; + bool patched; }; /** @@ -71,7 +66,7 @@ struct klp_func { * @kobj: kobject for sysfs resources * @mod: kernel module associated with the patched object * (NULL for vmlinux) - * @state: tracks object-level patch application state + * @patched: the object's funcs have been added to the klp_ops list */ struct klp_object { /* external */ @@ -81,7 +76,7 @@ struct klp_object { /* internal */ struct kobject kobj; struct module *mod; - enum klp_state state; + bool patched; }; /** @@ -90,7 +85,7 @@ struct klp_object { * @objs: object entries for kernel objects to be patched * @list: list node for global list of registered patches * @kobj: kobject for sysfs resources - * @state: tracks patch-level application state + * @enabled: the patch is enabled (but operation may be incomplete) */ struct klp_patch { /* external */ @@ -100,7 +95,7 @@ struct klp_patch { /* internal */ struct list_head list; struct kobject kobj; - enum klp_state state; + bool enabled; }; #define klp_for_each_object(patch, obj) \ diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 217b39d..2dbd355 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -348,11 +348,11 @@ static unsigned long klp_get_ftrace_location(unsigned long faddr) } #endif -static void klp_disable_func(struct klp_func *func) +static void klp_unpatch_func(struct klp_func *func) { struct klp_ops *ops; - if (WARN_ON(func->state != KLP_ENABLED)) + if (WARN_ON(!func->patched)) return; if (WARN_ON(!func->old_addr)) return; @@ -378,10 +378,10 @@ static void klp_disable_func(struct klp_func *func) list_del_rcu(>stack_node); } - func->state = KLP_DISABLED; + func->patched = false; } -static int klp_enable_func(struct klp_func *func) +static int klp_patch_func(struct klp_func *func) { struct klp_ops *ops; int ret; @@ -389,7 +389,7 @@ static int klp_enable_func(struct klp_func *func) if (WARN_ON(!func->old_addr)) return -EINVAL; - if (WARN_ON(func->state != KLP_DISABLED)) + if (WARN_ON(func->patched)) return -EINVAL; ops = klp_find_ops(func->old_addr); @@ -437,7 +437,7 @@ static int klp_enable_func(struct klp_func *func) list_add_rcu(>stack_node, >func_stack); } - func->state = KLP_ENABLED; + func->patched = true; return 0; @@ -448,36 +448,36 @@ static int klp_enable_func(struct klp_func *func) return ret; } -static void klp_disable_object(struct klp_object *obj) +static void klp_unpatch_object(struct klp_object *obj) { struct klp_func *func; klp_for_each_func(obj, func) - if (func->state == KLP_ENABLED) - klp_disable_func(func); + if (func->patched) +
[PATCH v3 07/15] livepatch/s390: add TIF_PATCH_PENDING thread flag
From: Miroslav BenesUpdate a task's patch state when returning from a system call or user space interrupt, or after handling a signal. This greatly increases the chances of a patch operation succeeding. If a task is I/O bound, it can be patched when returning from a system call. If a task is CPU bound, it can be patched when returning from an interrupt. If a task is sleeping on a to-be-patched function, the user can send SIGSTOP and SIGCONT to force it to switch. Since there are two ways the syscall can be restarted on return from a signal handling process, it is important to clear the flag before do_signal() is called. Otherwise we could miss the migration if we used SIGSTOP/SIGCONT procedure or fake signal to migrate patching blocking tasks. If we place our hook to sysc_work label in entry before TIF_SIGPENDING is evaluated we kill two birds with one stone. The task is correctly migrated in all return paths from a syscall. Signed-off-by: Miroslav Benes Signed-off-by: Josh Poimboeuf --- arch/s390/include/asm/thread_info.h | 2 ++ arch/s390/kernel/entry.S| 31 ++- 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h index 4977668..646845e 100644 --- a/arch/s390/include/asm/thread_info.h +++ b/arch/s390/include/asm/thread_info.h @@ -56,6 +56,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src); #define TIF_SIGPENDING 1 /* signal pending */ #define TIF_NEED_RESCHED 2 /* rescheduling necessary */ #define TIF_UPROBE 3 /* breakpointed or single-stepping */ +#define TIF_PATCH_PENDING 4 /* pending live patching update */ #define TIF_31BIT 16 /* 32bit process */ #define TIF_MEMDIE 17 /* is terminating due to OOM killer */ @@ -74,6 +75,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src); #define _TIF_SIGPENDING_BITUL(TIF_SIGPENDING) #define _TIF_NEED_RESCHED _BITUL(TIF_NEED_RESCHED) #define _TIF_UPROBE_BITUL(TIF_UPROBE) +#define _TIF_PATCH_PENDING _BITUL(TIF_PATCH_PENDING) #define _TIF_31BIT _BITUL(TIF_31BIT) #define _TIF_SINGLE_STEP _BITUL(TIF_SINGLE_STEP) diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S index 161f4e6..33848a8 100644 --- a/arch/s390/kernel/entry.S +++ b/arch/s390/kernel/entry.S @@ -47,7 +47,7 @@ STACK_SIZE = 1 << STACK_SHIFT STACK_INIT = STACK_SIZE - STACK_FRAME_OVERHEAD - __PT_SIZE _TIF_WORK = (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED | \ - _TIF_UPROBE) + _TIF_UPROBE | _TIF_PATCH_PENDING) _TIF_TRACE = (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SECCOMP | \ _TIF_SYSCALL_TRACEPOINT) _CIF_WORK = (_CIF_MCCK_PENDING | _CIF_ASCE | _CIF_FPU) @@ -352,6 +352,11 @@ ENTRY(system_call) #endif TSTMSK __PT_FLAGS(%r11),_PIF_PER_TRAP jo .Lsysc_singlestep +#ifdef CONFIG_LIVEPATCH + TSTMSK __TI_flags(%r12),_TIF_PATCH_PENDING + jo .Lsysc_patch_pending# handle live patching just before + # signals and possible syscall restart +#endif TSTMSK __TI_flags(%r12),_TIF_SIGPENDING jo .Lsysc_sigpending TSTMSK __TI_flags(%r12),_TIF_NOTIFY_RESUME @@ -426,6 +431,16 @@ ENTRY(system_call) #endif # +# _TIF_PATCH_PENDING is set, call klp_update_patch_state +# +#ifdef CONFIG_LIVEPATCH +.Lsysc_patch_pending: + lg %r2,__LC_CURRENT# pass pointer to task struct + larl%r14,.Lsysc_return + jg klp_update_patch_state +#endif + +# # _PIF_PER_TRAP is set, call do_per_trap # .Lsysc_singlestep: @@ -674,6 +689,10 @@ ENTRY(io_int_handler) jo .Lio_mcck_pending TSTMSK __TI_flags(%r12),_TIF_NEED_RESCHED jo .Lio_reschedule +#ifdef CONFIG_LIVEPATCH + TSTMSK __TI_flags(%r12),_TIF_PATCH_PENDING + jo .Lio_patch_pending +#endif TSTMSK __TI_flags(%r12),_TIF_SIGPENDING jo .Lio_sigpending TSTMSK __TI_flags(%r12),_TIF_NOTIFY_RESUME @@ -720,6 +739,16 @@ ENTRY(io_int_handler) j .Lio_return # +# _TIF_PATCH_PENDING is set, call klp_update_patch_state +# +#ifdef CONFIG_LIVEPATCH +.Lio_patch_pending: + lg %r2,__LC_CURRENT# pass pointer to task struct + larl%r14,.Lio_return + jg klp_update_patch_state +#endif + +# # _TIF_SIGPENDING or is set, call do_signal # .Lio_sigpending: -- 2.7.4
[PATCH v3 06/15] livepatch/s390: reorganize TIF thread flag bits
From: Jiri SlabyGroup the TIF thread flag bits by their inclusion in the _TIF_WORK and _TIF_TRACE macros. Signed-off-by: Jiri Slaby Signed-off-by: Josh Poimboeuf --- arch/s390/include/asm/thread_info.h | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h index a5b54a4..4977668 100644 --- a/arch/s390/include/asm/thread_info.h +++ b/arch/s390/include/asm/thread_info.h @@ -51,14 +51,12 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src); /* * thread information flags bit numbers */ +/* _TIF_WORK bits */ #define TIF_NOTIFY_RESUME 0 /* callback before returning to user */ #define TIF_SIGPENDING 1 /* signal pending */ #define TIF_NEED_RESCHED 2 /* rescheduling necessary */ -#define TIF_SYSCALL_TRACE 3 /* syscall trace active */ -#define TIF_SYSCALL_AUDIT 4 /* syscall auditing active */ -#define TIF_SECCOMP5 /* secure computing */ -#define TIF_SYSCALL_TRACEPOINT 6 /* syscall tracepoint instrumentation */ -#define TIF_UPROBE 7 /* breakpointed or single-stepping */ +#define TIF_UPROBE 3 /* breakpointed or single-stepping */ + #define TIF_31BIT 16 /* 32bit process */ #define TIF_MEMDIE 17 /* is terminating due to OOM killer */ #define TIF_RESTORE_SIGMASK18 /* restore signal mask in do_signal() */ @@ -66,15 +64,23 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src); #define TIF_BLOCK_STEP 20 /* This task is block stepped */ #define TIF_UPROBE_SINGLESTEP 21 /* This task is uprobe single stepped */ +/* _TIF_TRACE bits */ +#define TIF_SYSCALL_TRACE 24 /* syscall trace active */ +#define TIF_SYSCALL_AUDIT 25 /* syscall auditing active */ +#define TIF_SECCOMP26 /* secure computing */ +#define TIF_SYSCALL_TRACEPOINT 27 /* syscall tracepoint instrumentation */ + #define _TIF_NOTIFY_RESUME _BITUL(TIF_NOTIFY_RESUME) #define _TIF_SIGPENDING_BITUL(TIF_SIGPENDING) #define _TIF_NEED_RESCHED _BITUL(TIF_NEED_RESCHED) +#define _TIF_UPROBE_BITUL(TIF_UPROBE) + +#define _TIF_31BIT _BITUL(TIF_31BIT) +#define _TIF_SINGLE_STEP _BITUL(TIF_SINGLE_STEP) + #define _TIF_SYSCALL_TRACE _BITUL(TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT _BITUL(TIF_SYSCALL_AUDIT) #define _TIF_SECCOMP _BITUL(TIF_SECCOMP) #define _TIF_SYSCALL_TRACEPOINT_BITUL(TIF_SYSCALL_TRACEPOINT) -#define _TIF_UPROBE_BITUL(TIF_UPROBE) -#define _TIF_31BIT _BITUL(TIF_31BIT) -#define _TIF_SINGLE_STEP _BITUL(TIF_SINGLE_STEP) #endif /* _ASM_THREAD_INFO_H */ -- 2.7.4
[PATCH v3 05/15] livepatch/powerpc: add TIF_PATCH_PENDING thread flag
Add the TIF_PATCH_PENDING thread flag to enable the new livepatch per-task consistency model for powerpc. The bit getting set indicates the thread has a pending patch which needs to be applied when the thread exits the kernel. The bit is included in the _TIF_USER_WORK_MASK macro so that do_notify_resume() and klp_update_patch_state() get called when the bit is set. Signed-off-by: Josh Poimboeuf--- arch/powerpc/include/asm/thread_info.h | 4 +++- arch/powerpc/kernel/signal.c | 4 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index 87e4b2d..6fc6464 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -92,6 +92,7 @@ static inline struct thread_info *current_thread_info(void) TIF_NEED_RESCHED */ #define TIF_32BIT 4 /* 32 bit binary */ #define TIF_RESTORE_TM 5 /* need to restore TM FP/VEC/VSX */ +#define TIF_PATCH_PENDING 6 /* pending live patching update */ #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ #define TIF_SINGLESTEP 8 /* singlestepping active */ #define TIF_NOHZ 9 /* in adaptive nohz mode */ @@ -115,6 +116,7 @@ static inline struct thread_info *current_thread_info(void) #define _TIF_POLLING_NRFLAG(1<
[PATCH v3 04/15] livepatch/x86: add TIF_PATCH_PENDING thread flag
Add the TIF_PATCH_PENDING thread flag to enable the new livepatch per-task consistency model for x86_64. The bit getting set indicates the thread has a pending patch which needs to be applied when the thread exits the kernel. The bit is placed in the _TIF_ALLWORK_MASK macro, which results in exit_to_usermode_loop() calling klp_update_patch_state() when it's set. Signed-off-by: Josh Poimboeuf--- arch/x86/entry/common.c| 9 ++--- arch/x86/include/asm/thread_info.h | 4 +++- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index bdd9cc5..16a51a5 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include @@ -129,14 +130,13 @@ static long syscall_trace_enter(struct pt_regs *regs) #define EXIT_TO_USERMODE_LOOP_FLAGS\ (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ -_TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY) +_TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING) static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags) { /* * In order to return to user mode, we need to have IRQs off with -* none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY, -* _TIF_UPROBE, or _TIF_NEED_RESCHED set. Several of these flags +* none of EXIT_TO_USERMODE_LOOP_FLAGS set. Several of these flags * can be set at any time on preemptable kernels if we have IRQs on, * so we need to loop. Disabling preemption wouldn't help: doing the * work to clear some of the flags can sleep. @@ -163,6 +163,9 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags) if (cached_flags & _TIF_USER_RETURN_NOTIFY) fire_user_return_notifiers(); + if (cached_flags & _TIF_PATCH_PENDING) + klp_update_patch_state(current); + /* Disable IRQs and retry */ local_irq_disable(); diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index 1fe6043..79f4d6a 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -84,6 +84,7 @@ struct thread_info { #define TIF_SECCOMP8 /* secure computing */ #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ #define TIF_UPROBE 12 /* breakpointed or singlestepping */ +#define TIF_PATCH_PENDING 13 /* pending live patching update */ #define TIF_NOTSC 16 /* TSC is not accessible in userland */ #define TIF_IA32 17 /* IA32 compatibility process */ #define TIF_NOHZ 19 /* in adaptive nohz mode */ @@ -107,6 +108,7 @@ struct thread_info { #define _TIF_SECCOMP (1 << TIF_SECCOMP) #define _TIF_USER_RETURN_NOTIFY(1 << TIF_USER_RETURN_NOTIFY) #define _TIF_UPROBE(1 << TIF_UPROBE) +#define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING) #define _TIF_NOTSC (1 << TIF_NOTSC) #define _TIF_IA32 (1 << TIF_IA32) #define _TIF_NOHZ (1 << TIF_NOHZ) @@ -133,7 +135,7 @@ struct thread_info { (_TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME | _TIF_SIGPENDING |\ _TIF_SINGLESTEP | _TIF_NEED_RESCHED | _TIF_SYSCALL_EMU | \ _TIF_SYSCALL_AUDIT | _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE | \ -_TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ) +_TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ | _TIF_PATCH_PENDING) /* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW \ -- 2.7.4
[PATCH v3 02/15] x86/entry: define _TIF_ALLWORK_MASK flags explicitly
The _TIF_ALLWORK_MASK macro automatically includes the least-significant 16 bits of the thread_info flags, which is less than obvious and tends to create confusion and surprises when reading or modifying the code. Define the flags explicitly. Signed-off-by: Josh Poimboeuf--- arch/x86/include/asm/thread_info.h | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index ad6f5eb0..1fe6043 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -73,9 +73,6 @@ struct thread_info { * thread information flags * - these are process state flags that various assembly files * may need to access - * - pending work-to-be-done flags are in LSW - * - other flags in MSW - * Warning: layout of LSW is hardcoded in entry.S */ #define TIF_SYSCALL_TRACE 0 /* syscall trace active */ #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */ @@ -133,8 +130,10 @@ struct thread_info { /* work to do on any return to user space */ #define _TIF_ALLWORK_MASK \ - ((0x & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT | \ - _TIF_NOHZ) + (_TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME | _TIF_SIGPENDING |\ +_TIF_SINGLESTEP | _TIF_NEED_RESCHED | _TIF_SYSCALL_EMU | \ +_TIF_SYSCALL_AUDIT | _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE | \ +_TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ) /* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW \ -- 2.7.4
[PATCH v3 01/15] stacktrace/x86: add function for detecting reliable stack traces
For live patching and possibly other use cases, a stack trace is only useful if it can be assured that it's completely reliable. Add a new save_stack_trace_tsk_reliable() function to achieve that. Scenarios which indicate that a stack trace may be unreliable: - running task - interrupt stack - preemption - corrupted stack data - stack grows the wrong way - stack walk doesn't reach the bottom - user didn't provide a large enough entries array Also add CONFIG_HAVE_RELIABLE_STACKTRACE so arch-independent code can determine at build time whether the function is implemented. Signed-off-by: Josh Poimboeuf--- arch/Kconfig | 6 + arch/x86/Kconfig | 1 + arch/x86/include/asm/unwind.h | 6 + arch/x86/kernel/stacktrace.c | 59 +- arch/x86/kernel/unwind_frame.c | 1 + include/linux/stacktrace.h | 8 +++--- kernel/stacktrace.c| 12 +++-- 7 files changed, 87 insertions(+), 6 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 13f27c1..d61a133 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -678,6 +678,12 @@ config HAVE_STACK_VALIDATION Architecture supports the 'objtool check' host tool command, which performs compile-time stack metadata validation. +config HAVE_RELIABLE_STACKTRACE + bool + help + Architecture has a save_stack_trace_tsk_reliable() function which + only returns a stack trace if it can guarantee the trace is reliable. + config HAVE_ARCH_HASH bool default n diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 215612c..b4a6663 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -155,6 +155,7 @@ config X86 select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP select HAVE_REGS_AND_STACK_ACCESS_API + select HAVE_RELIABLE_STACKTRACE if X86_64 && FRAME_POINTER && STACK_VALIDATION select HAVE_STACK_VALIDATIONif X86_64 select HAVE_SYSCALL_TRACEPOINTS select HAVE_UNSTABLE_SCHED_CLOCK diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h index c5a7f3a..44f86dc 100644 --- a/arch/x86/include/asm/unwind.h +++ b/arch/x86/include/asm/unwind.h @@ -11,6 +11,7 @@ struct unwind_state { unsigned long stack_mask; struct task_struct *task; int graph_idx; + bool error; #ifdef CONFIG_FRAME_POINTER unsigned long *bp; struct pt_regs *regs; @@ -40,6 +41,11 @@ void unwind_start(struct unwind_state *state, struct task_struct *task, __unwind_start(state, task, regs, first_frame); } +static inline bool unwind_error(struct unwind_state *state) +{ + return state->error; +} + #ifdef CONFIG_FRAME_POINTER static inline diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c index 0653788..3e0cf5e 100644 --- a/arch/x86/kernel/stacktrace.c +++ b/arch/x86/kernel/stacktrace.c @@ -74,6 +74,64 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace) } EXPORT_SYMBOL_GPL(save_stack_trace_tsk); +#ifdef CONFIG_HAVE_RELIABLE_STACKTRACE +static int __save_stack_trace_reliable(struct stack_trace *trace, + struct task_struct *task) +{ + struct unwind_state state; + struct pt_regs *regs; + unsigned long addr; + + for (unwind_start(, task, NULL, NULL); !unwind_done(); +unwind_next_frame()) { + + regs = unwind_get_entry_regs(); + if (regs) { + /* +* Preemption and page faults on the stack can make +* frame pointers unreliable. +*/ + if (!user_mode(regs)) + return -1; + + /* +* This frame contains the (user mode) pt_regs at the +* end of the stack. Finish the unwind. +*/ + unwind_next_frame(); + break; + } + + addr = unwind_get_return_address(); + if (!addr || save_stack_address(trace, addr, false)) + return -1; + } + + if (!unwind_done() || unwind_error()) + return -1; + + if (trace->nr_entries < trace->max_entries) + trace->entries[trace->nr_entries++] = ULONG_MAX; + + return 0; +} + +int save_stack_trace_tsk_reliable(struct task_struct *tsk, + struct stack_trace *trace) +{ + int ret; + + if (!try_get_task_stack(tsk)) + return -EINVAL; + + ret = __save_stack_trace_reliable(trace, tsk); + + put_task_stack(tsk); + + return ret; +} +#endif /* CONFIG_HAVE_RELIABLE_STACKTRACE */ + /* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */ struct
[PATCH v3 00/15] livepatch: hybrid consistency model
Dusting the cobwebs off the consistency model again. This is based on linux-next/master. v1 was posted on 2015-02-09: https://lkml.kernel.org/r/cover.1423499826.git.jpoim...@redhat.com v2 was posted on 2016-04-28: https://lkml.kernel.org/r/cover.1461875890.git.jpoim...@redhat.com The biggest issue from v2 was finding a decent way to detect preemption and page faults on the stack of a sleeping task. That problem was solved by rewriting the x86 stack unwinder. The new unwinder helps detect such cases by finding all pt_regs on the stack. When preemption/page faults are detected, the stack is considered unreliable and the patching of the task is deferred. For more details about the consistency model, see patch 13/15. --- v3: - rebase on new x86 unwinder - force !HAVE_RELIABLE_STACKTRACE arches to use patch->immediate for now, because we don't have a way to transition kthreads otherwise - rebase s390 TIF_PATCH_PENDING patch onto latest entry code - update barrier comments and move barrier from the end of klp_init_transition() to its callers - "klp_work" -> "klp_transition_work" - "klp_patch_task()" -> "klp_update_patch_state()" - explicit _TIF_ALLWORK_MASK - change klp_reverse_transition() to not try to complete transition. instead modify the work queue delay to zero. - get rid of klp_schedule_work() in favor of calling schedule_delayed_work() directly with a KLP_TRANSITION_DELAY - initialize klp_target_state to KLP_UNDEFINED - move klp_target_state assignment to before patch->immediate check in klp_init_transition() - rcu_read_lock() in klp_update_patch_state(), test the thread flag in patch task, synchronize_rcu() in klp_complete_transition() - use kstrtobool() in enabled_store() - change task_rq_lock() argument type to struct rq_flags - add several WARN_ON_ONCE assertions for klp_target_state and task->patch_state v2: - "universe" -> "patch state" - rename klp_update_task_universe() -> klp_patch_task() - add preempt IRQ tracking (TF_PREEMPT_IRQ) - fix print_context_stack_reliable() bug - improve print_context_stack_reliable() comments - klp_ftrace_handler comment fixes - add "patch_state" proc file to tid_base_stuff - schedule work even for !RELIABLE_STACKTRACE - forked child inherits patch state from parent - add detailed comment to livepatch.h klp_func definition about the klp_func patched/transition state transitions - update exit_to_usermode_loop() comment - clear all TIF_KLP_NEED_UPDATE flags in klp_complete_transition() - remove unnecessary function externs - add livepatch documentation, sysfs documentation, /proc documentation - /proc/pid/patch_state: -1 means no patch is currently being applied/reverted - "TIF_KLP_NEED_UPDATE" -> "TIF_PATCH_PENDING" - support for s390 and powerpc-le - don't assume stacks with dynamic ftrace trampolines are reliable - add _TIF_ALLWORK_MASK info to commit log v1.9: - revive from the dead and rebased - reliable stacks! - add support for immediate consistency model - add a ton of comments - fix up memory barriers - remove "allow patch modules to be removed" patch for now, it still needs more discussion and thought - it can be done with something - "proc/pid/universe" -> "proc/pid/patch_status" - remove WARN_ON_ONCE from !func condition in ftrace handler -- can happen because of RCU - keep klp_mutex private by putting the work_fn in core.c - convert states from int to boolean - remove obsolete '@state' comments - several header file and include improvements suggested by Jiri S - change kallsyms_lookup_size_offset() errors from EINVAL -> ENOENT - change proc file permissions S_IRUGO -> USR - use klp_for_each_object/func helpers --- Jiri Slaby (1): livepatch/s390: reorganize TIF thread flag bits Josh Poimboeuf (12): stacktrace/x86: add function for detecting reliable stack traces x86/entry: define _TIF_ALLWORK_MASK flags explicitly livepatch: temporary stubs for klp_patch_pending() and klp_update_patch_state() livepatch/x86: add TIF_PATCH_PENDING thread flag livepatch/powerpc: add TIF_PATCH_PENDING thread flag livepatch: separate enabled and patched states livepatch: remove unnecessary object loaded check livepatch: move patching functions into patch.c livepatch: use kstrtobool() in enabled_store() livepatch: store function sizes livepatch: change to a per-task consistency model livepatch: add /proc//patch_state Miroslav Benes (2): livepatch/s390: add TIF_PATCH_PENDING thread flag livepatch: allow removal of a disabled patch Documentation/ABI/testing/sysfs-kernel-livepatch | 8 + Documentation/filesystems/proc.txt | 18 + Documentation/livepatch/livepatch.txt| 156 ++-- arch/Kconfig | 6 + arch/powerpc/include/asm/thread_info.h | 4 +- arch/powerpc/kernel/signal.c | 4 + arch/s390/include/asm/thread_info.h | 24 +- arch/s390/kernel/entry.S | 31 +-
Re: [PATCH 3/3] powerpc: enable support for GCC plugins
On Thu, Dec 8, 2016 at 6:42 AM, PaX Teamwrote: > On 6 Dec 2016 at 17:28, Andrew Donnellan wrote: > >> Enable support for GCC plugins on powerpc. >> >> Add an additional version check in gcc-plugins-check to advise users to >> upgrade to gcc 5.2+ on powerpc to avoid issues with header files (gcc <= >> 4.6) or missing copies of rs6000-cpus.def (4.8 to 5.1 on 64-bit targets). > > i don't think that this is the right approach. there's a general and a special > issue here, both of which need different handling. > > the general problem is to detect problems related to gcc plugin headers and > notify the users about solutions. emitting various messages from a Makefile > is certainly not a scalable approach, just imagine how it will look when the > other 30+ archs begin to add their own special cases... if anything, they > should be documented in Documentation/gcc-plugins.txt (or a new doc if it > grows too big) and the Makefile message should just point at it. > > as for the solutions, the general advice should enable the use of otherwise > failing gcc versions instead of forcing updating to new ones (though the > latter is advisable for other reasons but not everyone's in the position to > do so easily). in my experience all one needs to do is manually install the > missing files from the gcc sources (ideally distros would take care of it). > > the specific problem addressed here can (and IMHO should) be solved in > another way: remove the inclusion of the offending headers in gcc-common.h > as neither tm.h nor c-common.h are needed by existing plugins. for background, > i created gcc-common.h to simplify plugin development across all supportable > gcc versions i came across over the years, so it follows the 'everything but > the kitchen sink' approach. that isn't necessarily what the kernel and other > projects need so they should just use my version as a basis and fork/simplify > it (even i maintain private forks of the public version). If removing those will lower the requirement for PPC, that would be ideal. Otherwise, I'd like to take the practical approach of making the plugins available on PPC right now, with an eye towards relaxing the version requirement as people need it. > as for the location of c-common.h, upstream gcc moved it under c-family in > 2010 after the release of 4.5, so it should be where gcc-common.h expects > it and i'm not sure how it ended up at its old location for you. That is rather odd. What distro was the PPC test done on? (Or were these manually built gcc versions?) -Kees -- Kees Cook Nexus Security
Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
On Thu, 8 Dec 2016 19:19:56 +1100 Alexey Kardashevskiywrote: > This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO > without passing them to user space which saves time on switching > to user space and back. > > This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM. > KVM tries to handle a TCE request in the real mode, if failed > it passes the request to the virtual mode to complete the operation. > If it a virtual mode handler fails, the request is passed to > the user space; this is not expected to happen though. > > To avoid dealing with page use counters (which is tricky in real mode), > this only accelerates SPAPR TCE IOMMU v2 clients which are required > to pre-register the userspace memory. The very first TCE request will > be handled in the VFIO SPAPR TCE driver anyway as the userspace view > of the TCE table (iommu_table::it_userspace) is not allocated till > the very first mapping happens and we cannot call vmalloc in real mode. > > This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to > the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd > and associates a physical IOMMU table with the SPAPR TCE table (which > is a guest view of the hardware IOMMU table). The iommu_table object > is referenced so we do not have to retrieve in real mode when hypercall > happens. > > This does not implement the UNSET counterpart as there is no use for it - > once the acceleration is enabled, the existing userspace won't > disable it unless a VFIO container is detroyed so this adds necessary > cleanup to the KVM_DEV_VFIO_GROUP_DEL handler. > > This uses the kvm->lock mutex to protect against a race between > the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's > release() callback. > > This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user > space. > > This finally makes use of vfio_external_user_iommu_id() which was > introduced quite some time ago and was considered for removal. > > Tests show that this patch increases transmission speed from 220MB/s > to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > > Signed-off-by: Alexey Kardashevskiy > --- > Documentation/virtual/kvm/devices/vfio.txt | 21 +- > arch/powerpc/include/asm/kvm_host.h| 8 + > arch/powerpc/include/asm/kvm_ppc.h | 5 + > include/uapi/linux/kvm.h | 8 + > arch/powerpc/kvm/book3s_64_vio.c | 302 > + > arch/powerpc/kvm/book3s_64_vio_hv.c| 178 + > arch/powerpc/kvm/powerpc.c | 2 + > virt/kvm/vfio.c| 108 +++ > 8 files changed, 630 insertions(+), 2 deletions(-) > > diff --git a/Documentation/virtual/kvm/devices/vfio.txt > b/Documentation/virtual/kvm/devices/vfio.txt > index ef51740c67ca..ddb5a6512ab3 100644 > --- a/Documentation/virtual/kvm/devices/vfio.txt > +++ b/Documentation/virtual/kvm/devices/vfio.txt > @@ -16,7 +16,24 @@ Groups: > > KVM_DEV_VFIO_GROUP attributes: >KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking > + kvm_device_attr.addr points to an int32_t file descriptor > + for the VFIO group. >KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking > + kvm_device_attr.addr points to an int32_t file descriptor > + for the VFIO group. > + KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table > + allocated by sPAPR KVM. > + kvm_device_attr.addr points to a struct: > > -For each, kvm_device_attr.addr points to an int32_t file descriptor > -for the VFIO group. > + struct kvm_vfio_spapr_tce { > + __u32 argsz; > + __s32 groupfd; > + __s32 tablefd; > + __u8pad[4]; > + }; > + > + where > + @argsz is the size of kvm_vfio_spapr_tce_liobn; > + @groupfd is a file descriptor for a VFIO group; > + @tablefd is a file descriptor for a TCE table allocated via > + KVM_CREATE_SPAPR_TCE. > diff --git a/arch/powerpc/include/asm/kvm_host.h > b/arch/powerpc/include/asm/kvm_host.h > index 28350a294b1e..94774503c70d 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -191,6 +191,13 @@ struct kvmppc_pginfo { > atomic_t refcnt; > }; > > +struct kvmppc_spapr_tce_iommu_table { > + struct rcu_head rcu; > + struct list_head next; > + struct iommu_table *tbl; > + atomic_t refs; > +}; > + > struct kvmppc_spapr_tce_table { > struct list_head list; > struct kvm *kvm; > @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table { > u32 page_shift; > u64 offset; /* in pages */ > u64 size; /* window size in pages */ > + struct list_head iommu_tables; > struct page *pages[0]; > }; > > diff
Re: [PATCH 3/3] powerpc: enable support for GCC plugins
On 6 Dec 2016 at 17:28, Andrew Donnellan wrote: > Enable support for GCC plugins on powerpc. > > Add an additional version check in gcc-plugins-check to advise users to > upgrade to gcc 5.2+ on powerpc to avoid issues with header files (gcc <= > 4.6) or missing copies of rs6000-cpus.def (4.8 to 5.1 on 64-bit targets). i don't think that this is the right approach. there's a general and a special issue here, both of which need different handling. the general problem is to detect problems related to gcc plugin headers and notify the users about solutions. emitting various messages from a Makefile is certainly not a scalable approach, just imagine how it will look when the other 30+ archs begin to add their own special cases... if anything, they should be documented in Documentation/gcc-plugins.txt (or a new doc if it grows too big) and the Makefile message should just point at it. as for the solutions, the general advice should enable the use of otherwise failing gcc versions instead of forcing updating to new ones (though the latter is advisable for other reasons but not everyone's in the position to do so easily). in my experience all one needs to do is manually install the missing files from the gcc sources (ideally distros would take care of it). the specific problem addressed here can (and IMHO should) be solved in another way: remove the inclusion of the offending headers in gcc-common.h as neither tm.h nor c-common.h are needed by existing plugins. for background, i created gcc-common.h to simplify plugin development across all supportable gcc versions i came across over the years, so it follows the 'everything but the kitchen sink' approach. that isn't necessarily what the kernel and other projects need so they should just use my version as a basis and fork/simplify it (even i maintain private forks of the public version). as for the location of c-common.h, upstream gcc moved it under c-family in 2010 after the release of 4.5, so it should be where gcc-common.h expects it and i'm not sure how it ended up at its old location for you. cheers, PaX Team
Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing
Reviewed-by: Brian King-- Brian King Power Linux I/O IBM Linux Technology Center
Re: linux-next: build failure in the powerpc allyesconfig build
On Monday, December 5, 2016 4:22:04 PM CET Stephen Rothwell wrote: > Hi all, > > After mergeing everything but Andrew's tree, today's linux-next build > (powerpc allyesconfig) failed like this: > > kallsyms failure: relative symbol value 0xc000 out of range in > relative mode > > I have no idea what caused this, so I have left the powerpc allyesconfig > build broken for now. I get this on an x86-64 randconfig build: kallsyms failure: relative symbol value 0x8100 out of range in relative mode This is probably related, so it's not something powerpc specific. Arnd
Re: 4.9.0-rc8 - rcutorture test failure
On Thu, Dec 08, 2016 at 11:54:15AM +0530, Sachin Sant wrote: > RCU Torture test on powerpc fails during its run against latest mainline > (4.9.0-rc8) tree. > > 07:58:25 BUG: rcutorture tests failed ! > 07:58:25 21:31:00 ERROR| child process failed > 07:58:25 21:31:00 INFO | ERROR rcutorture rcutorture > timestamp=1481164260localtime=Dec 07 21:31:00 > 07:58:25 BUG: rcutorture tests failed ! > 07:58:25 21:31:00 INFO | END ERROR rcutorture rcutorture > timestamp=1481164260localtime=Dec 07 21:31:00 > > I have attached complete rcutorture run log. Thank you for running this, Sachin! But I am not seeing this as a failure. The last status print from the log you attached is as follows: 07:58:25 [ 2778.876118] rcu-torture: rtc: (null) ver: 24968 tfle: 0 rta: 24968 rtaf: 0 rtf: 24959 rtmbe: 0 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 10218404 onoff: 0/0:0/0 -1,0:-1,0 0:0 (HZ=250) barrier: 0/0:0 cbflood: 22703 07:58:25 [ 2778.876251] rcu-torture: Reader Pipe: 161849976604 399197 0 0 0 0 0 0 0 0 0 07:58:25 [ 2778.876438] rcu-torture: Reader Batch: 145090807711 16759538163 0 0 0 0 0 0 0 0 0 07:58:25 [ 2778.876625] rcu-torture: Free-Block Circulation: 24967 24967 24966 24965 24964 24963 24962 24961 24960 24959 0 07:58:25 [ 2778.876829] rcu-torture:--- End of test: SUCCESS: nreaders=79 nfakewriters=4 stat_interval=60 verbose=1 test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 shutdown_secs=0 stall_cpu=0 stall_cpu_holdoff=10 n_barrier_cbs=0 onoff_interval=0 onoff_holdoff=0 The "SUCCESS" indicates that rcutorture thought that it succeeded. Also, in the "Reader Pipe" and "Reader Batch" lines, only the first two numbers in the series at the end of each line are non-zero, which also indicates a non-broken RCU. So could you please let me know what your scripting didn't like about this log? Thanx, Paul Full log: 07:19:04 20:51:39 INFO | Test: running rcutorture tests 07:19:04 20:51:39 INFO | rcutorture 07:19:05 20:51:40 INFO |START rcutorture rcutorture timestamp=1481161900localtime=Dec 07 20:51:40 07:19:05 20:51:40 INFO | Check if CONFIG_RCU_TORTURE_TEST is enabled 07:19:05 07:19:05 [ 418.897476] rcu-torture:--- Start of test: nreaders=79 nfakewriters=4 stat_interval=60 verbose=1 test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 shutdown_secs=0 stall_cpu=0 stall_cpu_holdoff=10 n_barrier_cbs=0 onoff_interval=0 onoff_holdoff=0 07:19:05 [ 418.897843] rcu-torture: Creating rcu_torture_writer task 07:19:05 [ 418.897935] rcu-torture: Creating rcu_torture_fakewriter task 07:19:05 [ 418.897941] rcu-torture: rcu_torture_writer task started 07:19:05 [ 418.898074] rcu-torture: Creating rcu_torture_fakewriter task 07:19:05 [ 418.898079] rcu-torture: rcu_torture_fakewriter task started 07:19:05 [ 418.898238] rcu-torture: Creating rcu_torture_fakewriter task 07:19:05 [ 418.898242] rcu-torture: rcu_torture_fakewriter task started 07:19:05 [ 418.898412] rcu-torture: Creating rcu_torture_fakewriter task 07:19:05 [ 418.898414] rcu-torture: rcu_torture_fakewriter task started 07:19:05 [ 418.898566] rcu-torture: Creating rcu_torture_reader task 07:19:05 [ 418.898569] rcu-torture: rcu_torture_fakewriter task started 07:19:05 [ 418.898711] rcu-torture: Creating rcu_torture_reader task 07:19:05 [ 418.898714] rcu-torture: rcu_torture_reader task started 07:19:05 [ 418.898840] rcu-torture: Creating rcu_torture_reader task 07:19:05 [ 418.898843] rcu-torture: rcu_torture_reader task started 07:19:05 [ 418.898970] rcu-torture: Creating rcu_torture_reader task 07:19:05 [ 418.898973] rcu-torture: rcu_torture_reader task started 07:19:05 [ 418.899099] rcu-torture: Creating rcu_torture_reader task 07:19:05 [ 418.899101] rcu-torture: rcu_torture_reader task started 07:19:05 [ 418.899227] rcu-torture: Creating rcu_torture_reader task 07:19:05 [ 418.899230] rcu-torture: rcu_torture_reader task started 07:19:05 [ 418.899357] rcu-torture: Creating rcu_torture_reader task 07:19:05 [ 418.899360] rcu-torture: rcu_torture_reader task started 07:19:05 [ 418.899485] rcu-torture: Creating rcu_torture_reader task 07:19:05 [ 418.899488] rcu-torture: rcu_torture_reader task started 07:19:05 [ 418.899630] rcu-torture: Creating rcu_torture_reader task 07:19:05 [ 418.899633] rcu-torture: rcu_torture_reader task started 07:19:05 [ 418.899789] rcu-torture: Creating rcu_torture_reader task 07:19:05 [ 418.899790] rcu-torture: rcu_torture_reader task started 07:19:05 [ 418.899937] rcu-torture: Creating rcu_torture_reader task 07:19:05 [ 418.899940] rcu-torture:
Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing
On Wed, Dec 07, 2016 at 05:31:26PM -0600, Tyrel Datwyler wrote: > The first byte of each CRQ entry is used to indicate whether an entry is > a valid response or free for the VIOS to use. After processing a > response the driver sets the valid byte to zero to indicate the entry is > now free to be reused. Add a memory barrier after this write to ensure > no other stores are reordered when updating the valid byte. > > Signed-off-by: Tyrel Datwyler> --- > drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c > b/drivers/scsi/ibmvscsi/ibmvscsi.c > index d9534ee..2f5b07e 100644 > --- a/drivers/scsi/ibmvscsi/ibmvscsi.c > +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c > @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data) > while ((crq = crq_queue_next_crq(>queue)) != NULL) { > ibmvscsi_handle_crq(crq, hostdata); > crq->valid = VIOSRP_CRQ_FREE; > + wmb(); > } > > vio_enable_interrupts(vdev); > @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data) > vio_disable_interrupts(vdev); > ibmvscsi_handle_crq(crq, hostdata); > crq->valid = VIOSRP_CRQ_FREE; > + wmb(); > } else { > done = 1; > } Is this something you have seen in the wild or just a "better save than sorry" barrier? Thanks, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 2/2] ibmvscsi: log bad SRP response opcode in hex format
On Wed, Dec 07, 2016 at 04:04:36PM -0600, Tyrel Datwyler wrote: > An unrecogonized or unsupported SRP response has its opcode currently > logged in decimal format. Log it in hex format instead so it can easily > be validated against the SRP specs values which are in hex. > > Signed-off-by: Tyrel Datwyler> --- Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 1/2] ibmvscsi: add vscsi hosts to global list_head
On Wed, Dec 07, 2016 at 04:04:35PM -0600, Tyrel Datwyler wrote: > Add each vscsi host adatper to a new global list_head named > ibmvscsi_head. There is no functional change. This is meant primarily > as a convience for locating adatpers from within the debugger or crash > utility. > > Signed-off-by: Tyrel Datwyler> --- Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
[PATCH kernel 7/9] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
It does not make much sense to have KVM in book3s-64 and not to have IOMMU bits for PCI pass through support as it costs little and allows VFIO to function on book3s KVM. Having IOMMU_API always enabled makes it unnecessary to have a lot of "#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those ifdef's we could have only user space emulated devices accelerated (but not VFIO) which do not seem to be very useful. Signed-off-by: Alexey Kardashevskiy--- arch/powerpc/kvm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index 029be26b5a17..65a471de96de 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -67,6 +67,7 @@ config KVM_BOOK3S_64 select KVM_BOOK3S_64_HANDLER select KVM select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE + select SPAPR_TCE_IOMMU if IOMMU_SUPPORT ---help--- Support running unmodified book3s_64 and book3s_32 guest kernels in virtual machines on book3s_64 host processors. -- 2.11.0
[PATCH kernel 6/9] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
In real mode, TCE tables are invalidated using special cache-inhibited store instructions which are not available in virtual mode This defines and implements exchange_rm() callback. This does not define set_rm/clear_rm/flush_rm callbacks as there is no user for those - exchange/exchange_rm are only to be used by KVM for VFIO. The exchange_rm callback is defined for IODA1/IODA2 powernv platforms. This replaces list_for_each_entry_rcu with its lockless version as from now on pnv_pci_ioda2_tce_invalidate() can be called in the real mode too. Signed-off-by: Alexey Kardashevskiy--- arch/powerpc/include/asm/iommu.h | 7 +++ arch/powerpc/kernel/iommu.c | 23 +++ arch/powerpc/platforms/powernv/pci-ioda.c | 26 +- 3 files changed, 55 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 9de8bad1fdf9..82e77ebf85f4 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -64,6 +64,11 @@ struct iommu_table_ops { long index, unsigned long *hpa, enum dma_data_direction *direction); + /* Real mode */ + int (*exchange_rm)(struct iommu_table *tbl, + long index, + unsigned long *hpa, + enum dma_data_direction *direction); #endif void (*clear)(struct iommu_table *tbl, long index, long npages); @@ -209,6 +214,8 @@ extern void iommu_del_device(struct device *dev); extern int __init tce_iommu_bus_notifier_init(void); extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry, unsigned long *hpa, enum dma_data_direction *direction); +extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry, + unsigned long *hpa, enum dma_data_direction *direction); #else static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index d12496889ce9..d02b8d22fb50 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1022,6 +1022,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry, } EXPORT_SYMBOL_GPL(iommu_tce_xchg); +long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry, + unsigned long *hpa, enum dma_data_direction *direction) +{ + long ret; + + ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction); + + if (!ret && ((*direction == DMA_FROM_DEVICE) || + (*direction == DMA_BIDIRECTIONAL))) { + struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT); + + if (likely(pg)) { + SetPageDirty(pg); + } else { + tbl->it_ops->exchange_rm(tbl, entry, hpa, direction); + ret = -EFAULT; + } + } + + return ret; +} +EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm); + int iommu_take_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index ea181f02bebd..f2c2ab8fbb3e 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1855,6 +1855,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index, return ret; } + +static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index, + unsigned long *hpa, enum dma_data_direction *direction) +{ + long ret = pnv_tce_xchg(tbl, index, hpa, direction); + + if (!ret) + pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true); + + return ret; +} #endif static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index, @@ -1869,6 +1880,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = { .set = pnv_ioda1_tce_build, #ifdef CONFIG_IOMMU_API .exchange = pnv_ioda1_tce_xchg, + .exchange_rm = pnv_ioda1_tce_xchg_rm, #endif .clear = pnv_ioda1_tce_free, .get = pnv_tce_get, @@ -1943,7 +1955,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl, { struct iommu_table_group_link *tgl; - list_for_each_entry_rcu(tgl, >it_group_list, next) { + list_for_each_entry_lockless(tgl, >it_group_list, next) { struct pnv_ioda_pe *pe = container_of(tgl->table_group, struct pnv_ioda_pe, table_group); struct pnv_phb *phb = pe->phb; @@ -1999,6 +2011,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index, return ret; } + +static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index, +
[PATCH kernel 2/9] powerpc/iommu: Cleanup iommu_table disposal
At the moment iommu_table can be disposed by either calling iommu_table_free() directly or it_ops::free(); the only implementation of free() is in IODA2 - pnv_ioda2_table_free() - and it calls iommu_table_free() anyway. As we are going to have reference counting on tables, we need an unified way of disposing tables. This moves it_ops::free() call into iommu_free_table() and makes use of the latter. The free() callback now handles only platform-specific data. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy--- arch/powerpc/kernel/iommu.c | 4 arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++ drivers/vfio/vfio_iommu_spapr_tce.c | 2 +- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 5f202a566ec5..6744a2771769 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) if (!tbl) return; + if (tbl->it_ops->free) + tbl->it_ops->free(tbl); + if (!tbl->it_map) { kfree(tbl); return; @@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) /* free table */ kfree(tbl); } +EXPORT_SYMBOL_GPL(iommu_free_table); /* Creates TCEs for a user provided buffer. The user buffer must be * contiguous real kernel storage (not vmalloc). The address passed here diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 5fcae29107e1..c4f9e812ca6c 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1422,7 +1422,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe iommu_group_put(pe->table_group.group); BUG_ON(pe->table_group.group); } - pnv_pci_ioda2_table_free_pages(tbl); iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); } @@ -2013,7 +2012,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index, static void pnv_ioda2_table_free(struct iommu_table *tbl) { pnv_pci_ioda2_table_free_pages(tbl); - iommu_free_table(tbl, "pnv"); } static struct iommu_table_ops pnv_ioda2_iommu_ops = { @@ -2339,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe) if (rc) { pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n", rc); - pnv_ioda2_table_free(tbl); + iommu_free_table(tbl, ""); return rc; } @@ -2425,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group) pnv_pci_ioda2_set_bypass(pe, false); pnv_pci_ioda2_unset_window(>table_group, 0); - pnv_ioda2_table_free(tbl); + iommu_free_table(tbl, "pnv"); } static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index c8823578a1b2..cbac08af400e 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container *container, unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT; tce_iommu_userspace_view_free(tbl, container->mm); - tbl->it_ops->free(tbl); + iommu_free_table(tbl, ""); decrement_locked_vm(container->mm, pages); } -- 2.11.0
[PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO without passing them to user space which saves time on switching to user space and back. This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM. KVM tries to handle a TCE request in the real mode, if failed it passes the request to the virtual mode to complete the operation. If it a virtual mode handler fails, the request is passed to the user space; this is not expected to happen though. To avoid dealing with page use counters (which is tricky in real mode), this only accelerates SPAPR TCE IOMMU v2 clients which are required to pre-register the userspace memory. The very first TCE request will be handled in the VFIO SPAPR TCE driver anyway as the userspace view of the TCE table (iommu_table::it_userspace) is not allocated till the very first mapping happens and we cannot call vmalloc in real mode. This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd and associates a physical IOMMU table with the SPAPR TCE table (which is a guest view of the hardware IOMMU table). The iommu_table object is referenced so we do not have to retrieve in real mode when hypercall happens. This does not implement the UNSET counterpart as there is no use for it - once the acceleration is enabled, the existing userspace won't disable it unless a VFIO container is detroyed so this adds necessary cleanup to the KVM_DEV_VFIO_GROUP_DEL handler. This uses the kvm->lock mutex to protect against a race between the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's release() callback. This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user space. This finally makes use of vfio_external_user_iommu_id() which was introduced quite some time ago and was considered for removal. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Signed-off-by: Alexey Kardashevskiy--- Documentation/virtual/kvm/devices/vfio.txt | 21 +- arch/powerpc/include/asm/kvm_host.h| 8 + arch/powerpc/include/asm/kvm_ppc.h | 5 + include/uapi/linux/kvm.h | 8 + arch/powerpc/kvm/book3s_64_vio.c | 302 + arch/powerpc/kvm/book3s_64_vio_hv.c| 178 + arch/powerpc/kvm/powerpc.c | 2 + virt/kvm/vfio.c| 108 +++ 8 files changed, 630 insertions(+), 2 deletions(-) diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt index ef51740c67ca..ddb5a6512ab3 100644 --- a/Documentation/virtual/kvm/devices/vfio.txt +++ b/Documentation/virtual/kvm/devices/vfio.txt @@ -16,7 +16,24 @@ Groups: KVM_DEV_VFIO_GROUP attributes: KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking + kvm_device_attr.addr points to an int32_t file descriptor + for the VFIO group. KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking + kvm_device_attr.addr points to an int32_t file descriptor + for the VFIO group. + KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table + allocated by sPAPR KVM. + kvm_device_attr.addr points to a struct: -For each, kvm_device_attr.addr points to an int32_t file descriptor -for the VFIO group. + struct kvm_vfio_spapr_tce { + __u32 argsz; + __s32 groupfd; + __s32 tablefd; + __u8pad[4]; + }; + + where + @argsz is the size of kvm_vfio_spapr_tce_liobn; + @groupfd is a file descriptor for a VFIO group; + @tablefd is a file descriptor for a TCE table allocated via + KVM_CREATE_SPAPR_TCE. diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 28350a294b1e..94774503c70d 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -191,6 +191,13 @@ struct kvmppc_pginfo { atomic_t refcnt; }; +struct kvmppc_spapr_tce_iommu_table { + struct rcu_head rcu; + struct list_head next; + struct iommu_table *tbl; + atomic_t refs; +}; + struct kvmppc_spapr_tce_table { struct list_head list; struct kvm *kvm; @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table { u32 page_shift; u64 offset; /* in pages */ u64 size; /* window size in pages */ + struct list_head iommu_tables; struct page *pages[0]; }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 0a21c8503974..17b947a0060d 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -163,6 +163,11 @@ extern long
[PATCH kernel 8/9] KVM: PPC: Pass kvm* to kvmppc_find_table()
The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm* there. This will be used in the following patches where we will be attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than to VCPU). Signed-off-by: Alexey Kardashevskiy--- arch/powerpc/include/asm/kvm_ppc.h | 2 +- arch/powerpc/kvm/book3s_64_vio.c| 7 --- arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++-- 3 files changed, 12 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index f6e49640dbe1..0a21c8503974 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce_64 *args); extern struct kvmppc_spapr_tce_table *kvmppc_find_table( - struct kvm_vcpu *vcpu, unsigned long liobn); + struct kvm *kvm, unsigned long liobn); extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt, unsigned long ioba, unsigned long npages); extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt, diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index c379ff5a4438..15df8ae627d9 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -212,12 +212,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { - struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn); + struct kvmppc_spapr_tce_table *stt; long ret; /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ /* liobn, ioba, tce); */ + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; @@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, u64 __user *tces; u64 tce; - stt = kvmppc_find_table(vcpu, liobn); + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; @@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, struct kvmppc_spapr_tce_table *stt; long i, ret; - stt = kvmppc_find_table(vcpu, liobn); + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index a3be4bd6188f..8a6834e6e1c8 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -49,10 +49,9 @@ * WARNING: This will be called in real or virtual mode on HV KVM and virtual * mode on PR KVM */ -struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu, +struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm, unsigned long liobn) { - struct kvm *kvm = vcpu->kvm; struct kvmppc_spapr_tce_table *stt; list_for_each_entry_lockless(stt, >arch.spapr_tce_tables, list) @@ -194,12 +193,13 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup( long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { - struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn); + struct kvmppc_spapr_tce_table *stt; long ret; /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ /* liobn, ioba, tce); */ + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; @@ -252,7 +252,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, unsigned long tces, entry, ua = 0; unsigned long *rmap = NULL; - stt = kvmppc_find_table(vcpu, liobn); + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; @@ -335,7 +335,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu, struct kvmppc_spapr_tce_table *stt; long i, ret; - stt = kvmppc_find_table(vcpu, liobn); + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; @@ -356,12 +356,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu, long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba) { - struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn); + struct kvmppc_spapr_tce_table *stt; long ret; unsigned long idx; struct page *page; u64 *tbl; + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; -- 2.11.0
[PATCH kernel 5/9] KVM: PPC: Use preregistered memory API to access TCE list
VFIO on sPAPR already implements guest memory pre-registration when the entire guest RAM gets pinned. This can be used to translate the physical address of a guest page containing the TCE list from H_PUT_TCE_INDIRECT. This makes use of the pre-registrered memory API to access TCE list pages in order to avoid unnecessary locking on the KVM memory reverse map as we know that all of guest memory is pinned and we have a flat array mapping GPA to HPA which makes it simpler and quicker to index into that array (even with looking up the kernel page tables in vmalloc_to_phys) than it is to find the memslot, lock the rmap entry, look up the user page tables, and unlock the rmap entry. Note that the rmap pointer is initialized to NULL where declared (not in this patch). Signed-off-by: Alexey Kardashevskiy--- Changes: v2: * updated the commit log with Paul's comment --- arch/powerpc/kvm/book3s_64_vio_hv.c | 65 - 1 file changed, 49 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index d461c440889a..a3be4bd6188f 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa, EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua); #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu) +{ + return mm_iommu_preregistered(vcpu->kvm->mm); +} + +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup( + struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size) +{ + return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size); +} + long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (ret != H_SUCCESS) return ret; - if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , )) - return H_TOO_HARD; + if (kvmppc_preregistered(vcpu)) { + /* +* We get here if guest memory was pre-registered which +* is normally VFIO case and gpa->hpa translation does not +* depend on hpt. +*/ + struct mm_iommu_table_group_mem_t *mem; - rmap = (void *) vmalloc_to_phys(rmap); + if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , NULL)) + return H_TOO_HARD; - /* -* Synchronize with the MMU notifier callbacks in -* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.). -* While we have the rmap lock, code running on other CPUs -* cannot finish unmapping the host real page that backs -* this guest real page, so we are OK to access the host -* real page. -*/ - lock_rmap(rmap); - if (kvmppc_rm_ua_to_hpa(vcpu, ua, )) { - ret = H_TOO_HARD; - goto unlock_exit; + mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K); + if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, )) + return H_TOO_HARD; + } else { + /* +* This is emulated devices case. +* We do not require memory to be preregistered in this case +* so lock rmap and do __find_linux_pte_or_hugepte(). +*/ + if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , )) + return H_TOO_HARD; + + rmap = (void *) vmalloc_to_phys(rmap); + + /* +* Synchronize with the MMU notifier callbacks in +* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.). +* While we have the rmap lock, code running on other CPUs +* cannot finish unmapping the host real page that backs +* this guest real page, so we are OK to access the host +* real page. +*/ + lock_rmap(rmap); + if (kvmppc_rm_ua_to_hpa(vcpu, ua, )) { + ret = H_TOO_HARD; + goto unlock_exit; + } } for (i = 0; i < npages; ++i) { @@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, } unlock_exit: - unlock_rmap(rmap); + if (rmap) + unlock_rmap(rmap); return ret; } -- 2.11.0
[PATCH kernel 4/9] powerpc/mmu: Add real mode support for IOMMU preregistered memory
This makes mm_iommu_lookup() able to work in realmode by replacing list_for_each_entry_rcu() (which can do debug stuff which can fail in real mode) with list_for_each_entry_lockless(). This adds realmode version of mm_iommu_ua_to_hpa() which adds explicit vmalloc'd-to-linear address conversion. Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail. This changes mm_iommu_preregistered() to receive @mm as in real mode @current does not always have a correct pointer. This adds realmode version of mm_iommu_lookup() which receives @mm (for the same reason as for mm_iommu_preregistered()) and uses lockless version of list_for_each_entry_rcu(). Signed-off-by: Alexey Kardashevskiy--- arch/powerpc/include/asm/mmu_context.h | 4 arch/powerpc/mm/mmu_context_iommu.c| 39 ++ 2 files changed, 43 insertions(+) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index b9e3f0aca261..c70c8272523d 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm); extern void mm_iommu_cleanup(struct mm_struct *mm); extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm, unsigned long ua, unsigned long size); +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm( + struct mm_struct *mm, unsigned long ua, unsigned long size); extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, unsigned long ua, unsigned long entries); extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, unsigned long ua, unsigned long *hpa); +extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, + unsigned long ua, unsigned long *hpa); extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem); extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem); #endif diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index 104bad029ce9..631d32f5937b 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm, } EXPORT_SYMBOL_GPL(mm_iommu_lookup); +struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm, + unsigned long ua, unsigned long size) +{ + struct mm_iommu_table_group_mem_t *mem, *ret = NULL; + + list_for_each_entry_lockless(mem, >context.iommu_group_mem_list, + next) { + if ((mem->ua <= ua) && + (ua + size <= mem->ua + +(mem->entries << PAGE_SHIFT))) { + ret = mem; + break; + } + } + + return ret; +} +EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm); + struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, unsigned long ua, unsigned long entries) { @@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, } EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa); +long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, + unsigned long ua, unsigned long *hpa) +{ + const long entry = (ua - mem->ua) >> PAGE_SHIFT; + void *va = >hpas[entry]; + unsigned long *pa; + + if (entry >= mem->entries) + return -EFAULT; + + pa = (void *) vmalloc_to_phys(va); + if (!pa) + return -EFAULT; + + *hpa = *pa | (ua & ~PAGE_MASK); + + return 0; +} +EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm); + long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem) { if (atomic64_inc_not_zero(>mapped)) -- 2.11.0
[PATCH kernel 3/9] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
So far iommu_table obejcts were only used in virtual mode and had a single owner. We are going to change this by implementing in-kernel acceleration of DMA mapping requests. The proposed acceleration will handle requests in real mode and KVM will keep references to tables. This adds a kref to iommu_table and defines new helpers to update it. This replaces iommu_free_table() with iommu_table_put() and makes iommu_free_table() static. iommu_table_get() is not used in this patch but it will be in the following patch. Since this touches prototypes, this also removes @node_name parameter as it has never been really useful on powernv and carrying it for the pseries platform code to iommu_free_table() seems to be quite useless as well. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy--- arch/powerpc/include/asm/iommu.h | 5 +++-- arch/powerpc/kernel/iommu.c | 24 +++- arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++--- arch/powerpc/platforms/powernv/pci.c | 1 + arch/powerpc/platforms/pseries/iommu.c| 3 ++- arch/powerpc/platforms/pseries/vio.c | 2 +- drivers/vfio/vfio_iommu_spapr_tce.c | 2 +- 7 files changed, 34 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 2c1d50792944..9de8bad1fdf9 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -114,6 +114,7 @@ struct iommu_table { struct list_head it_group_list;/* List of iommu_table_group_link */ unsigned long *it_userspace; /* userspace view of the table */ struct iommu_table_ops *it_ops; + struct krefit_kref; }; #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \ @@ -146,8 +147,8 @@ static inline void *get_iommu_table_base(struct device *dev) extern int dma_iommu_dma_supported(struct device *dev, u64 mask); -/* Frees table for an individual device node */ -extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); +extern void iommu_table_get(struct iommu_table *tbl); +extern void iommu_table_put(struct iommu_table *tbl); /* Initializes an iommu_table based in values set in the passed-in * structure diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 6744a2771769..d12496889ce9 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid) return tbl; } -void iommu_free_table(struct iommu_table *tbl, const char *node_name) +static void iommu_table_free(struct kref *kref) { unsigned long bitmap_sz; unsigned int order; + struct iommu_table *tbl; - if (!tbl) - return; + tbl = container_of(kref, struct iommu_table, it_kref); if (tbl->it_ops->free) tbl->it_ops->free(tbl); @@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) /* verify that table contains no entries */ if (!bitmap_empty(tbl->it_map, tbl->it_size)) - pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name); + pr_warn("%s: Unexpected TCEs\n", __func__); /* calculate bitmap size in bytes */ bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long); @@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) /* free table */ kfree(tbl); } -EXPORT_SYMBOL_GPL(iommu_free_table); + +void iommu_table_get(struct iommu_table *tbl) +{ + kref_get(>it_kref); +} +EXPORT_SYMBOL_GPL(iommu_table_get); + +void iommu_table_put(struct iommu_table *tbl) +{ + if (!tbl) + return; + + kref_put(>it_kref, iommu_table_free); +} +EXPORT_SYMBOL_GPL(iommu_table_put); /* Creates TCEs for a user provided buffer. The user buffer must be * contiguous real kernel storage (not vmalloc). The address passed here diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index c4f9e812ca6c..ea181f02bebd 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1422,7 +1422,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe iommu_group_put(pe->table_group.group); BUG_ON(pe->table_group.group); } - iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); + iommu_table_put(tbl); } static void pnv_ioda_release_vf_PE(struct pci_dev *pdev) @@ -2197,7 +2197,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb, __free_pages(tce_mem, get_order(tce32_segsz * segs)); if (tbl) { pnv_pci_unlink_table_and_group(tbl, >table_group); - iommu_free_table(tbl, "pnv"); + iommu_table_put(tbl);
[PATCH kernel 1/9] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
This adds a capability number for in-kernel support for VFIO on SPAPR platform. The capability will tell the user space whether in-kernel handlers of H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space must not attempt allocating a TCE table in the host kernel via the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests will not be passed to the user space which is desired action in the situation like that. Signed-off-by: Alexey KardashevskiyReviewed-by: David Gibson --- include/uapi/linux/kvm.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 300ef255d1e0..810f74317987 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -870,6 +870,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_S390_USER_INSTR0 130 #define KVM_CAP_MSI_DEVID 131 #define KVM_CAP_PPC_HTM 132 +#define KVM_CAP_SPAPR_TCE_VFIO 133 #ifdef KVM_CAP_IRQ_ROUTING -- 2.11.0
[PATCH kernel 0/9] powerpc/kvm/vfio: Enable in-kernel acceleration
This is my current queue of patches to add acceleration of TCE updates in KVM. This is based on the "next" branch of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git I am not doing changelog here as it is 4 months since last respin and I am sure everybody lost the context anyway, I tried to be as detailed as I could in the very last patch, others are pretty trivial anyway. Please comment. Thanks. Alexey Kardashevskiy (9): KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number powerpc/iommu: Cleanup iommu_table disposal powerpc/vfio_spapr_tce: Add reference counting to iommu_table powerpc/mmu: Add real mode support for IOMMU preregistered memory KVM: PPC: Use preregistered memory API to access TCE list powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently KVM: PPC: Pass kvm* to kvmppc_find_table() KVM: PPC: Add in-kernel acceleration for VFIO Documentation/virtual/kvm/devices/vfio.txt | 21 +- arch/powerpc/include/asm/iommu.h | 12 +- arch/powerpc/include/asm/kvm_host.h| 8 + arch/powerpc/include/asm/kvm_ppc.h | 7 +- arch/powerpc/include/asm/mmu_context.h | 4 + include/uapi/linux/kvm.h | 9 + arch/powerpc/kernel/iommu.c| 49 - arch/powerpc/kvm/book3s_64_vio.c | 309 - arch/powerpc/kvm/book3s_64_vio_hv.c| 256 ++-- arch/powerpc/kvm/powerpc.c | 2 + arch/powerpc/mm/mmu_context_iommu.c| 39 arch/powerpc/platforms/powernv/pci-ioda.c | 42 +++- arch/powerpc/platforms/powernv/pci.c | 1 + arch/powerpc/platforms/pseries/iommu.c | 3 +- arch/powerpc/platforms/pseries/vio.c | 2 +- drivers/vfio/vfio_iommu_spapr_tce.c| 2 +- virt/kvm/vfio.c| 108 ++ arch/powerpc/kvm/Kconfig | 1 + 18 files changed, 828 insertions(+), 47 deletions(-) -- 2.11.0