Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO

2016-12-08 Thread Alexey Kardashevskiy
On 09/12/16 04:55, Alex Williamson wrote:
> On Thu,  8 Dec 2016 19:19:56 +1100
> Alexey Kardashevskiy  wrote:
> 
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is referenced so we do not have to retrieve in real mode when hypercall
>> happens.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is detroyed so this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> This uses the kvm->lock mutex to protect against a race between
>> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
>> release() callback.
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
>>  arch/powerpc/include/asm/kvm_host.h|   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h |   5 +
>>  include/uapi/linux/kvm.h   |   8 +
>>  arch/powerpc/kvm/book3s_64_vio.c   | 302 
>> +
>>  arch/powerpc/kvm/book3s_64_vio_hv.c| 178 +
>>  arch/powerpc/kvm/powerpc.c |   2 +
>>  virt/kvm/vfio.c| 108 +++
>>  8 files changed, 630 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
>> b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740c67ca..ddb5a6512ab3 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,24 @@ Groups:
>>  
>>  KVM_DEV_VFIO_GROUP attributes:
>>KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +kvm_device_attr.addr points to an int32_t file descriptor
>> +for the VFIO group.
>>KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +kvm_device_attr.addr points to an int32_t file descriptor
>> +for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
>> +allocated by sPAPR KVM.
>> +kvm_device_attr.addr points to a struct:
>>  
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +struct kvm_vfio_spapr_tce {
>> +__u32   argsz;
>> +__s32   groupfd;
>> +__s32   tablefd;
>> +__u8pad[4];
>> +};
>> +
>> +where
>> +@argsz is the size of kvm_vfio_spapr_tce_liobn;
>> +@groupfd is a file descriptor for a VFIO group;
>> +@tablefd is a file descriptor for a TCE table allocated via
>> +KVM_CREATE_SPAPR_TCE.
>> diff --git a/arch/powerpc/include/asm/kvm_host.h 
>> b/arch/powerpc/include/asm/kvm_host.h
>> index 28350a294b1e..94774503c70d 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>>  atomic_t refcnt;
>>  };
>>  
>> +struct kvmppc_spapr_tce_iommu_table {
>> +struct rcu_head rcu;
>> +struct list_head next;
>> +struct iommu_table *tbl;
>> +atomic_t refs;
>> +};
>> +
>>  struct kvmppc_spapr_tce_table {
>>  struct list_head list;
>>  struct kvm *kvm;
>> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>>  u32 page_shift;
>>  u64 offset; /* in pages */
>>  u64 size; 

[PATCH v2] cxl: prevent read/write to AFU config space while AFU not configured

2016-12-08 Thread Andrew Donnellan
During EEH recovery, we deconfigure all AFUs whilst leaving the
corresponding vPHB and virtual PCI device in place.

If something attempts to interact with the AFU's PCI config space (e.g.
running lspci) after the AFU has been deconfigured and before it's
reconfigured, cxl_pcie_{read,write}_config() will read invalid values from
the deconfigured struct cxl_afu and proceed to Oops when they try to
dereference pointers that have been set to NULL during deconfiguration.

Add a rwsem to struct cxl_afu so we can prevent interaction with config
space while the AFU is deconfigured.

Reported-by: Pradipta Ghosh 
Suggested-by: Frederic Barrat 
Cc: sta...@vger.kernel.org # v4.9+
Signed-off-by: Andrew Donnellan 
Signed-off-by: Vaibhav Jain 

---

v1 -> v2:

* Refactored to avoid locking over function boundaries - we now both lock
and unlock in cxl_pcie_{read,write}_config(), rather than locking in
cxl_pcie_config_info() and unlocking from the caller. Thanks Vaibhav.

* Changed the stable tag to 4.9 rather than 4.4 - by the time this is
merged, 4.9 will have landed, and I'll need to manually backport this for
4.4.
---
 drivers/misc/cxl/cxl.h  |  2 ++
 drivers/misc/cxl/main.c |  3 ++-
 drivers/misc/cxl/pci.c  |  2 ++
 drivers/misc/cxl/vphb.c | 51 -
 4 files changed, 35 insertions(+), 23 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index a144073..379c463 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -418,6 +418,8 @@ struct cxl_afu {
struct dentry *debugfs;
struct mutex contexts_lock;
spinlock_t afu_cntl_lock;
+   /* Used to block access to AFU config space while deconfigured */
+   struct rw_semaphore configured_rwsem;
 
/* AFU error buffer fields and bin attribute for sysfs */
u64 eb_len, eb_offset;
diff --git a/drivers/misc/cxl/main.c b/drivers/misc/cxl/main.c
index 62e0dfb..2a6bf1d 100644
--- a/drivers/misc/cxl/main.c
+++ b/drivers/misc/cxl/main.c
@@ -268,7 +268,8 @@ struct cxl_afu *cxl_alloc_afu(struct cxl *adapter, int 
slice)
idr_init(>contexts_idr);
mutex_init(>contexts_lock);
spin_lock_init(>afu_cntl_lock);
-
+   init_rwsem(>configured_rwsem);
+   down_write(>configured_rwsem);
afu->prefault_mode = CXL_PREFAULT_NONE;
afu->irqs_max = afu->adapter->user_irqs;
 
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index c4d79b5d..c7b2121 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1129,6 +1129,7 @@ static int pci_configure_afu(struct cxl_afu *afu, struct 
cxl *adapter, struct pc
if ((rc = cxl_native_register_psl_irq(afu)))
goto err2;
 
+   up_write(>configured_rwsem);
return 0;
 
 err2:
@@ -1141,6 +1142,7 @@ static int pci_configure_afu(struct cxl_afu *afu, struct 
cxl *adapter, struct pc
 
 static void pci_deconfigure_afu(struct cxl_afu *afu)
 {
+   down_write(>configured_rwsem);
cxl_native_release_psl_irq(afu);
if (afu->adapter->native->sl_ops->release_serr_irq)
afu->adapter->native->sl_ops->release_serr_irq(afu);
diff --git a/drivers/misc/cxl/vphb.c b/drivers/misc/cxl/vphb.c
index 3519ace..639a343 100644
--- a/drivers/misc/cxl/vphb.c
+++ b/drivers/misc/cxl/vphb.c
@@ -76,23 +76,22 @@ static int cxl_pcie_cfg_record(u8 bus, u8 devfn)
return (bus << 8) + devfn;
 }
 
-static int cxl_pcie_config_info(struct pci_bus *bus, unsigned int devfn,
-   struct cxl_afu **_afu, int *_record)
+static inline struct cxl_afu *pci_bus_to_afu(struct pci_bus *bus)
 {
-   struct pci_controller *phb;
-   struct cxl_afu *afu;
-   int record;
+   struct pci_controller *phb = bus ? pci_bus_to_host(bus) : NULL;
 
-   phb = pci_bus_to_host(bus);
-   if (phb == NULL)
-   return PCIBIOS_DEVICE_NOT_FOUND;
+   return phb ? phb->private_data : NULL;
+}
+
+static inline int cxl_pcie_config_info(struct pci_bus *bus, unsigned int devfn,
+  struct cxl_afu *afu, int *_record)
+{
+   int record;
 
-   afu = (struct cxl_afu *)phb->private_data;
record = cxl_pcie_cfg_record(bus->number, devfn);
if (record > afu->crs_num)
return PCIBIOS_DEVICE_NOT_FOUND;
 
-   *_afu = afu;
*_record = record;
return 0;
 }
@@ -106,9 +105,14 @@ static int cxl_pcie_read_config(struct pci_bus *bus, 
unsigned int devfn,
u16 val16;
u32 val32;
 
-   rc = cxl_pcie_config_info(bus, devfn, , );
+   afu = pci_bus_to_afu(bus);
+   /* Grab a reader lock on afu. */
+   if (afu == NULL || !down_read_trylock(>configured_rwsem))
+   return PCIBIOS_DEVICE_NOT_FOUND;
+
+   rc = cxl_pcie_config_info(bus, devfn, afu, );
if (rc)
-   return rc;
+   

Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing

2016-12-08 Thread Tyrel Datwyler
On 12/08/2016 01:06 AM, Johannes Thumshirn wrote:
> On Wed, Dec 07, 2016 at 05:31:26PM -0600, Tyrel Datwyler wrote:
>> The first byte of each CRQ entry is used to indicate whether an entry is
>> a valid response or free for the VIOS to use. After processing a
>> response the driver sets the valid byte to zero to indicate the entry is
>> now free to be reused. Add a memory barrier after this write to ensure
>> no other stores are reordered when updating the valid byte.
>>
>> Signed-off-by: Tyrel Datwyler 
>> ---
>>  drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c 
>> b/drivers/scsi/ibmvscsi/ibmvscsi.c
>> index d9534ee..2f5b07e 100644
>> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
>> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
>> @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data)
>>  while ((crq = crq_queue_next_crq(>queue)) != NULL) {
>>  ibmvscsi_handle_crq(crq, hostdata);
>>  crq->valid = VIOSRP_CRQ_FREE;
>> +wmb();
>>  }
>>  
>>  vio_enable_interrupts(vdev);
>> @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data)
>>  vio_disable_interrupts(vdev);
>>  ibmvscsi_handle_crq(crq, hostdata);
>>  crq->valid = VIOSRP_CRQ_FREE;
>> +wmb();
>>  } else {
>>  done = 1;
>>  }
> 
> Is this something you have seen in the wild or just a "better save than sorry"
> barrier?

I myself have not observed or heard of anybody hitting an issue here.
However, based on conversation with the VIOS developers, who have
indicated it is required, this is a "better safe than sorry" scenario.
Further, it matches what we already do in the ibmvfc driver for the CRQ
processing logic.

-Tyrel

> 
> Thanks,
>   Johannes
> 



Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing

2016-12-08 Thread Tyrel Datwyler
On 12/08/2016 03:29 PM, Paolo Bonzini wrote:
> 
> 
> On 08/12/2016 00:31, Tyrel Datwyler wrote:
>> The first byte of each CRQ entry is used to indicate whether an entry is
>> a valid response or free for the VIOS to use. After processing a
>> response the driver sets the valid byte to zero to indicate the entry is
>> now free to be reused. Add a memory barrier after this write to ensure
>> no other stores are reordered when updating the valid byte.
>>
>> Signed-off-by: Tyrel Datwyler 
>> ---
>>  drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c 
>> b/drivers/scsi/ibmvscsi/ibmvscsi.c
>> index d9534ee..2f5b07e 100644
>> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
>> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
>> @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data)
>>  while ((crq = crq_queue_next_crq(>queue)) != NULL) {
>>  ibmvscsi_handle_crq(crq, hostdata);
>>  crq->valid = VIOSRP_CRQ_FREE;
>> +wmb();
>>  }
>>  
>>  vio_enable_interrupts(vdev);
>> @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data)
>>  vio_disable_interrupts(vdev);
>>  ibmvscsi_handle_crq(crq, hostdata);
>>  crq->valid = VIOSRP_CRQ_FREE;
>> +wmb();
> 
> Should this driver use virt_wmb instead?

Both virt_wmb and wmb reduce to a lwsync instruction under PowerPC.

-Tyrel

> 
> Paolo
> 
>>  } else {
>>  done = 1;
>>  }
>>



Re: [PATCH 3/3] powerpc: enable support for GCC plugins

2016-12-08 Thread Andrew Donnellan

On 09/12/16 05:06, Kees Cook wrote:

i don't think that this is the right approach. there's a general and a special
issue here, both of which need different handling.

the general problem is to detect problems related to gcc plugin headers and
notify the users about solutions. emitting various messages from a Makefile
is certainly not a scalable approach, just imagine how it will look when the
other 30+ archs begin to add their own special cases... if anything, they
should be documented in Documentation/gcc-plugins.txt (or a new doc if it
grows too big) and the Makefile message should just point at it.


I think I agree in principle - Makefiles are already unreadable enough 
without a million special cases.



as for the solutions, the general advice should enable the use of otherwise
failing gcc versions instead of forcing updating to new ones (though the
latter is advisable for other reasons but not everyone's in the position to
do so easily). in my experience all one needs to do is manually install the
missing files from the gcc sources (ideally distros would take care of it).


If someone else is willing to write up that advice, then great.


the specific problem addressed here can (and IMHO should) be solved in
another way: remove the inclusion of the offending headers in gcc-common.h
as neither tm.h nor c-common.h are needed by existing plugins. for background,


We can't build without tm.h: http://pastebin.com/W0azfCr0

And we get warnings without c-common.h: http://pastebin.com/Aw8CAj10


as for the location of c-common.h, upstream gcc moved it under c-family in
2010 after the release of 4.5, so it should be where gcc-common.h expects
it and i'm not sure how it ended up at its old location for you.


That is rather odd. What distro was the PPC test done on? (Or were
these manually built gcc versions?)


These were all manually built using a script running on a Debian box. 
Installing precompiled distro versions of rather old gccs would have 
been somewhat challenging. I've just rebuilt 4.6.4 to double check that 
I wasn't just seeing things, but it seems that it definitely is still 
putting c-common.h in the old location.


--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



Re: [PATCH v2] PCI: designware: add host_init error handling

2016-12-08 Thread Jaehoon Chung
Hi Srinivas,


On 12/07/2016 07:32 PM, Srinivas Kandagatla wrote:
> This patch add support to return value from host_init() callback from drivers,
> so that the designware libary can handle or pass it to proper place. Issue 
> with
> void return type is that errors or error handling within host_init() callback
> are never know to designware code, which could go ahead and access registers
> even in error cases.
> 
> Typical case in qcom controller driver is to turn off clks in case of errors,
> if designware code continues to read/write register when clocks are turned off
> the board would reboot/lockup.

Added the comment for minor thing.
I agreed this approach.

> 
> Signed-off-by: Srinivas Kandagatla 
> ---
> Currently designware code does not have a way return errors generated
> as part of host_init() callback in controller drivers. This is an issue
> with controller drivers like qcom which turns off the clocks in error
> handling path. As the dw core is un aware of this would continue to
> access registers which faults resulting in board reboots/hangs.
> 
> There are two ways to solve this issue,
> one is remove error handling in the qcom controller host_init() function
> other is to handle error and pass back to dw core code which would then
> pass back to controller driver as part of dw_pcie_host_init() return value.
> 
> Second option seems more sensible and correct way to fix the issue,
> this patch does the same.
> 
> As part of this change to host_init() return type I had to patch other
> ihost controller drivers which use dw core. Most of the changes to other 
> drivers
> are to return proper error codes to upper layer.
> Only compile tested drivers.
> 
> Changes since RFC:
>   - Add error handling to other drivers as suggested by Joao Pinto
> 
>  drivers/pci/host/pci-dra7xx.c   | 10 --
>  drivers/pci/host/pci-exynos.c   | 10 --
>  drivers/pci/host/pci-imx6.c | 10 --
>  drivers/pci/host/pci-keystone.c | 10 --
>  drivers/pci/host/pci-layerscape.c   | 22 +-
>  drivers/pci/host/pcie-armada8k.c|  4 +++-
>  drivers/pci/host/pcie-designware-plat.c | 10 --
>  drivers/pci/host/pcie-designware.c  |  4 +++-
>  drivers/pci/host/pcie-designware.h  |  2 +-
>  drivers/pci/host/pcie-qcom.c|  5 +++--
>  drivers/pci/host/pcie-spear13xx.c   | 10 --
>  11 files changed, 71 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/pci/host/pci-dra7xx.c b/drivers/pci/host/pci-dra7xx.c
> index 9595fad..811f0f9 100644
> --- a/drivers/pci/host/pci-dra7xx.c
> +++ b/drivers/pci/host/pci-dra7xx.c
> @@ -127,9 +127,10 @@ static void dra7xx_pcie_enable_interrupts(struct 
> dra7xx_pcie *dra7xx)
>  LEG_EP_INTERRUPTS);
>  }
>  
> -static void dra7xx_pcie_host_init(struct pcie_port *pp)
> +static int dra7xx_pcie_host_init(struct pcie_port *pp)
>  {
>   struct dra7xx_pcie *dra7xx = to_dra7xx_pcie(pp);
> + int ret;
>  
>   pp->io_base &= DRA7XX_CPU_TO_BUS_ADDR;
>   pp->mem_base &= DRA7XX_CPU_TO_BUS_ADDR;
> @@ -138,10 +139,15 @@ static void dra7xx_pcie_host_init(struct pcie_port *pp)
>  
>   dw_pcie_setup_rc(pp);
>  
> - dra7xx_pcie_establish_link(dra7xx);
> + ret = dra7xx_pcie_establish_link(dra7xx);
> + if (ret < 0)
> + return ret;
> +
>   if (IS_ENABLED(CONFIG_PCI_MSI))
>   dw_pcie_msi_init(pp);
>   dra7xx_pcie_enable_interrupts(dra7xx);
> +
> + return 0;
>  }
>  
>  static struct pcie_host_ops dra7xx_pcie_host_ops = {
> diff --git a/drivers/pci/host/pci-exynos.c b/drivers/pci/host/pci-exynos.c
> index f1c544b..c116fd9 100644
> --- a/drivers/pci/host/pci-exynos.c
> +++ b/drivers/pci/host/pci-exynos.c
> @@ -458,12 +458,18 @@ static int exynos_pcie_link_up(struct pcie_port *pp)
>   return 0;
>  }
>  
> -static void exynos_pcie_host_init(struct pcie_port *pp)
> +static int exynos_pcie_host_init(struct pcie_port *pp)
>  {
>   struct exynos_pcie *exynos_pcie = to_exynos_pcie(pp);
> + int ret;
> +
> + ret = exynos_pcie_establish_link(exynos_pcie);
> + if (ret < 0)
> + return ret;
>  
> - exynos_pcie_establish_link(exynos_pcie);
>   exynos_pcie_enable_interrupts(exynos_pcie);
> +
> + return 0;
>  }
>  
>  static struct pcie_host_ops exynos_pcie_host_ops = {
> diff --git a/drivers/pci/host/pci-imx6.c b/drivers/pci/host/pci-imx6.c
> index c8cefb0..1251e92 100644
> --- a/drivers/pci/host/pci-imx6.c
> +++ b/drivers/pci/host/pci-imx6.c
> @@ -550,18 +550,24 @@ static int imx6_pcie_establish_link(struct imx6_pcie 
> *imx6_pcie)
>   return ret;
>  }
>  
> -static void imx6_pcie_host_init(struct pcie_port *pp)
> +static int imx6_pcie_host_init(struct pcie_port *pp)
>  {
>   struct imx6_pcie *imx6_pcie = to_imx6_pcie(pp);
> + int ret;
>  
>   imx6_pcie_assert_core_reset(imx6_pcie);
>   

[PATCHv2 1/4] pseries: Add hypercall wrappers for hash page table resizing

2016-12-08 Thread David Gibson
This adds the hypercall numbers and wrapper functions for the hash page
table resizing hypercalls.

These are experimental "platform specific" values for now, until we have a
formal PAPR update.

It also adds a new firmware feature flag to track the presence of the
HPT resizing calls.

Signed-off-by: David Gibson 
Reviewed-by: Paul Mackerras 
---
 arch/powerpc/include/asm/firmware.h   |  5 +++--
 arch/powerpc/include/asm/hvcall.h |  4 +++-
 arch/powerpc/include/asm/plpar_wrappers.h | 12 
 arch/powerpc/platforms/pseries/firmware.c |  1 +
 4 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 1e0b5a5..8645897 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -42,7 +42,7 @@
 #define FW_FEATURE_SPLPAR  ASM_CONST(0x0010)
 #define FW_FEATURE_LPARASM_CONST(0x0040)
 #define FW_FEATURE_PS3_LV1 ASM_CONST(0x0080)
-/* FreeASM_CONST(0x0100) */
+#define FW_FEATURE_HPT_RESIZE  ASM_CONST(0x0100)
 #define FW_FEATURE_CMO ASM_CONST(0x0200)
 #define FW_FEATURE_VPHNASM_CONST(0x0400)
 #define FW_FEATURE_XCMOASM_CONST(0x0800)
@@ -66,7 +66,8 @@ enum {
FW_FEATURE_MULTITCE | FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN,
+   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+   FW_FEATURE_HPT_RESIZE,
FW_FEATURE_PSERIES_ALWAYS = 0,
FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL,
FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 708edeb..9b7ff7c 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -275,7 +275,9 @@
 #define H_COP  0x304
 #define H_GET_MPP_X0x314
 #define H_SET_MODE 0x31C
-#define MAX_HCALL_OPCODE   H_SET_MODE
+#define H_RESIZE_HPT_PREPARE   0x36C
+#define H_RESIZE_HPT_COMMIT0x370
+#define MAX_HCALL_OPCODE   H_RESIZE_HPT_COMMIT
 
 /* H_VIOCTL functions */
 #define H_GET_VIOA_DUMP_SIZE   0x01
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index 1b39424..b7ee6d9 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -242,6 +242,18 @@ static inline long plpar_pte_protect(unsigned long flags, 
unsigned long ptex,
return plpar_hcall_norets(H_PROTECT, flags, ptex, avpn);
 }
 
+static inline long plpar_resize_hpt_prepare(unsigned long flags,
+   unsigned long shift)
+{
+   return plpar_hcall_norets(H_RESIZE_HPT_PREPARE, flags, shift);
+}
+
+static inline long plpar_resize_hpt_commit(unsigned long flags,
+  unsigned long shift)
+{
+   return plpar_hcall_norets(H_RESIZE_HPT_COMMIT, flags, shift);
+}
+
 static inline long plpar_tce_get(unsigned long liobn, unsigned long ioba,
unsigned long *tce_ret)
 {
diff --git a/arch/powerpc/platforms/pseries/firmware.c 
b/arch/powerpc/platforms/pseries/firmware.c
index ea7f09b..658c02d 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -64,6 +64,7 @@ hypertas_fw_features_table[] = {
{FW_FEATURE_VPHN,   "hcall-vphn"},
{FW_FEATURE_SET_MODE,   "hcall-set-mode"},
{FW_FEATURE_BEST_ENERGY,"hcall-best-energy-1*"},
+   {FW_FEATURE_HPT_RESIZE, "hcall-hpt-resize"},
 };
 
 /* Build up the firmware features bitmask using the contents of
-- 
2.9.3



[PATCHv2 2/4] pseries: Add support for hash table resizing

2016-12-08 Thread David Gibson
This adds support for using experimental hypercalls to change the size
of the main hash page table while running as a PAPR guest.  For now these
hypercalls are only in experimental qemu versions.

The interface is two part: first H_RESIZE_HPT_PREPARE is used to allocate
and prepare the new hash table.  This may be slow, but can be done
asynchronously.  Then, H_RESIZE_HPT_COMMIT is used to switch to the new
hash table.  This requires that no CPUs be concurrently updating the HPT,
and so must be run under stop_machine().

This also adds a debugfs file which can be used to manually control
HPT resizing or testing purposes.

Signed-off-by: David Gibson 
Reviewed-by: Paul Mackerras 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |   1 +
 arch/powerpc/mm/hash_utils_64.c   |  32 
 arch/powerpc/platforms/pseries/lpar.c | 110 ++
 3 files changed, 143 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index e407af2..efba649 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -153,6 +153,7 @@ struct mmu_hash_ops {
   unsigned long addr,
   unsigned char *hpte_slot_array,
   int psize, int ssize, int local);
+   int (*resize_hpt)(unsigned long shift);
/*
 * Special for kexec.
 * To be called in real mode with interrupts disabled. No locks are
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 78dabf06..61ce96c 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1815,3 +1816,34 @@ void hash__setup_initial_memory_limit(phys_addr_t 
first_memblock_base,
/* Finally limit subsequent allocations */
memblock_set_current_limit(ppc64_rma_size);
 }
+
+#ifdef CONFIG_DEBUG_FS
+
+static int ppc64_pft_size_get(void *data, u64 *val)
+{
+   *val = ppc64_pft_size;
+   return 0;
+}
+
+static int ppc64_pft_size_set(void *data, u64 val)
+{
+   if (!mmu_hash_ops.resize_hpt)
+   return -ENODEV;
+   return mmu_hash_ops.resize_hpt(val);
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_ppc64_pft_size,
+   ppc64_pft_size_get, ppc64_pft_size_set, "%llu\n");
+
+static int __init hash64_debugfs(void)
+{
+   if (!debugfs_create_file("pft-size", 0600, powerpc_debugfs_root,
+NULL, _ppc64_pft_size)) {
+   pr_err("lpar: unable to create ppc64_pft_size debugsfs file\n");
+   }
+
+   return 0;
+}
+machine_device_initcall(pseries, hash64_debugfs);
+
+#endif /* CONFIG_DEBUG_FS */
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index aa35245..5f0cee3 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -27,6 +27,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -589,6 +591,113 @@ static int __init disable_bulk_remove(char *str)
 
 __setup("bulk_remove=", disable_bulk_remove);
 
+#define HPT_RESIZE_TIMEOUT 1 /* ms */
+
+struct hpt_resize_state {
+   unsigned long shift;
+   int commit_rc;
+};
+
+static int pseries_lpar_resize_hpt_commit(void *data)
+{
+   struct hpt_resize_state *state = data;
+
+   state->commit_rc = plpar_resize_hpt_commit(0, state->shift);
+   if (state->commit_rc != H_SUCCESS)
+   return -EIO;
+
+   /* Hypervisor has transitioned the HTAB, update our globals */
+   ppc64_pft_size = state->shift;
+   htab_size_bytes = 1UL << ppc64_pft_size;
+   htab_hash_mask = (htab_size_bytes >> 7) - 1;
+
+   return 0;
+}
+
+/* Must be called in user context */
+static int pseries_lpar_resize_hpt(unsigned long shift)
+{
+   struct hpt_resize_state state = {
+   .shift = shift,
+   .commit_rc = H_FUNCTION,
+   };
+   unsigned int delay, total_delay = 0;
+   int rc;
+   ktime_t t0, t1, t2;
+
+   might_sleep();
+
+   if (!firmware_has_feature(FW_FEATURE_HPT_RESIZE))
+   return -ENODEV;
+
+   printk(KERN_INFO "lpar: Attempting to resize HPT to shift %lu\n",
+  shift);
+
+   t0 = ktime_get();
+
+   rc = plpar_resize_hpt_prepare(0, shift);
+   while (H_IS_LONG_BUSY(rc)) {
+   delay = get_longbusy_msecs(rc);
+   total_delay += delay;
+   if (total_delay > HPT_RESIZE_TIMEOUT) {
+   /* prepare call with shift==0 cancels an
+* in-progress resize */
+   rc = plpar_resize_hpt_prepare(0, 0);
+   if (rc != 

[PATCHv2 4/4] pseries: Automatically resize HPT for memory hot add/remove

2016-12-08 Thread David Gibson
We've now implemented code in the pseries platform to use the new PAPR
interface to allow resizing the hash page table (HPT) at runtime.

This patch uses that interface to automatically attempt to resize the HPT
when memory is hot added or removed.  This tries to always keep the HPT at
a reasonable size for our current memory size.

Signed-off-by: David Gibson 
Reviewed-by: Paul Mackerras 
---
 arch/powerpc/include/asm/sparsemem.h |  1 +
 arch/powerpc/mm/hash_utils_64.c  | 29 +
 arch/powerpc/mm/mem.c|  4 
 3 files changed, 34 insertions(+)

diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index f6fc0ee..737335c 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -16,6 +16,7 @@
 #endif /* CONFIG_SPARSEMEM */
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+extern void resize_hpt_for_hotplug(unsigned long new_mem_size);
 extern int create_section_mapping(unsigned long start, unsigned long end);
 extern int remove_section_mapping(unsigned long start, unsigned long end);
 #ifdef CONFIG_NUMA
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 61ce96c..abb4301 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -748,6 +748,35 @@ static unsigned long __init htab_get_table_size(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+void resize_hpt_for_hotplug(unsigned long new_mem_size)
+{
+   unsigned target_hpt_shift;
+
+   if (!mmu_hash_ops.resize_hpt)
+   return;
+
+   target_hpt_shift = htab_shift_for_mem_size(new_mem_size);
+
+   /*
+* To avoid lots of HPT resizes if memory size is fluctuating
+* across a boundary, we deliberately have some hysterisis
+* here: we immediately increase the HPT size if the target
+* shift exceeds the current shift, but we won't attempt to
+* reduce unless the target shift is at least 2 below the
+* current shift
+*/
+   if ((target_hpt_shift > ppc64_pft_size)
+   || (target_hpt_shift < (ppc64_pft_size - 1))) {
+   int rc;
+
+   rc = mmu_hash_ops.resize_hpt(target_hpt_shift);
+   if (rc)
+   printk(KERN_WARNING
+  "Unable to resize hash page table to target 
order %d: %d\n",
+  target_hpt_shift, rc);
+   }
+}
+
 int create_section_mapping(unsigned long start, unsigned long end)
 {
int rc = htab_bolt_mapping(start, end, __pa(start),
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5f84433..9ee536e 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -134,6 +134,8 @@ int arch_add_memory(int nid, u64 start, u64 size, bool 
for_device)
unsigned long nr_pages = size >> PAGE_SHIFT;
int rc;
 
+   resize_hpt_for_hotplug(memblock_phys_mem_size());
+
pgdata = NODE_DATA(nid);
 
start = (unsigned long)__va(start);
@@ -174,6 +176,8 @@ int arch_remove_memory(u64 start, u64 size)
 */
vm_unmap_aliases();
 
+   resize_hpt_for_hotplug(memblock_phys_mem_size());
+
return ret;
 }
 #endif
-- 
2.9.3



[PATCHv2 3/4] pseries: Advertise HPT resizing support via CAS

2016-12-08 Thread David Gibson
The hypervisor needs to know a guest is capable of using the HPT resizing
PAPR extension in order to make full advantage of it for memory hotplug.

If the hypervisor knows the guest is HPT resize aware, it can size the
initial HPT based on the initial guest RAM size, relying on the guest to
resize the HPT when more memory is hot-added.  Without this, the hypervisor
must size the HPT for the maximum possible guest RAM, which can lead to
a huge waste of space if the guest never actually expends to that maximum
size.

This patch advertises the guest's support for HPT resizing via the
ibm,client-architecture-support OF interface.  We use bit 5 of byte 6 of
option vector 5 for this purpose (tentatively assigned in in-progress PAPR
change request).

Signed-off-by: David Gibson 
Reviewed-by: Anshuman Khandual 
Reviewed-by: Paul Mackerras 
---
 arch/powerpc/include/asm/prom.h | 1 +
 arch/powerpc/kernel/prom_init.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 7f436ba..94c92bb 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -151,6 +151,7 @@ struct of_drconf_cell {
 #define OV5_XCMO   0x0440  /* Page Coalescing */
 #define OV5_TYPE1_AFFINITY 0x0580  /* Type 1 NUMA affinity */
 #define OV5_PRRN   0x0540  /* Platform Resource Reassignment */
+#define OV5_RESIZE_HPT 0x0601  /* Hash Page Table resizing */
 #define OV5_PFO_HW_RNG 0x0E80  /* PFO Random Number Generator */
 #define OV5_PFO_HW_842 0x0E40  /* PFO Compression Accelerator */
 #define OV5_PFO_HW_ENCR0x0E20  /* PFO Encryption Accelerator */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 88ac964..9942d9f 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -713,7 +713,7 @@ unsigned char ibm_architecture_vec[] = {
0,
 #endif
OV5_FEAT(OV5_TYPE1_AFFINITY) | OV5_FEAT(OV5_PRRN),
-   0,
+   OV5_FEAT(OV5_RESIZE_HPT),
0,
0,
/* WARNING: The offset of the "number of cores" field below
-- 
2.9.3



[PATCHv2 0/4] Hash Page Table resizing for PAPR guests

2016-12-08 Thread David Gibson
This series implements the guest side of a PAPR ACR which allows a
POWER guest's Hashed Page Table (HPT) to be resized at runtime.  This
is useful when a guest has a very large theoretical maximum RAM, but
is likely to only ever be expanded to a modest amount of RAM in
practice.  Without resizing the HPT has to be sized for the maximum
possible guest RAM, which can be very wasteful if that maximum is
never reached.

To use this requires a hypervisor/host which also supports the PAPR
extension.  The only implementation so far is my qemu branch at
https://github.com/dgibson/qemu/tree/upstream/hpt-resize

I expect to merge that code to upstream qemu for qemu-2.9.  Note that
HPT resizing will so far only work for TCG guests, KVM support is in
the works.  The guest side code here will not require changing for
KVM, however.

An HPT resize can be triggered in one of two ways:
* /sys/kernel/debug/powerpc/pft-size

This debugfs file contains the current size of the HPT (as
encoded in the ibm,pft-size) device tree property.  Writing to
it will cause the guest to attempt an HPT resize to the given
value.  Note that the current qemu implementation will not
allow the guest to resize the HPT to more than 1/64th of guest
RAM size.

* Automatically on memory hotplug / unplug

With these patches applied, the guest will automatically
attempt to resize its HPT when its RAM size changes due to
hotplug events.  (When hot adding RAM qemu considers the new
size for purposes of the limit mentioned above, so this method
can lift the HPT size larger than the former).

Changes since v1:
  * Remove two patches which belong with the (upcoming) host side
support rather than guest side (and therefore also should go via
the kvm tree instead of the powerpc tree)
  * Protected the debugfs code with #ifdef CONFIG_DEBUG_FS.  Couldn't
actually get a compile error myself, but the KVM buildbot did get
one.

David Gibson (4):
  pseries: Add hypercall wrappers for hash page table resizing
  pseries: Add support for hash table resizing
  pseries: Advertise HPT resizing support via CAS
  pseries: Automatically resize HPT for memory hot add/remove

 arch/powerpc/include/asm/book3s/64/mmu-hash.h |   1 +
 arch/powerpc/include/asm/firmware.h   |   5 +-
 arch/powerpc/include/asm/hvcall.h |   4 +-
 arch/powerpc/include/asm/plpar_wrappers.h |  12 +++
 arch/powerpc/include/asm/prom.h   |   1 +
 arch/powerpc/include/asm/sparsemem.h  |   1 +
 arch/powerpc/kernel/prom_init.c   |   2 +-
 arch/powerpc/mm/hash_utils_64.c   |  61 ++
 arch/powerpc/mm/mem.c |   4 +
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 arch/powerpc/platforms/pseries/lpar.c | 110 ++
 11 files changed, 198 insertions(+), 4 deletions(-)

-- 
2.9.3



Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing

2016-12-08 Thread Paolo Bonzini


On 08/12/2016 00:31, Tyrel Datwyler wrote:
> The first byte of each CRQ entry is used to indicate whether an entry is
> a valid response or free for the VIOS to use. After processing a
> response the driver sets the valid byte to zero to indicate the entry is
> now free to be reused. Add a memory barrier after this write to ensure
> no other stores are reordered when updating the valid byte.
> 
> Signed-off-by: Tyrel Datwyler 
> ---
>  drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c 
> b/drivers/scsi/ibmvscsi/ibmvscsi.c
> index d9534ee..2f5b07e 100644
> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
> @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data)
>   while ((crq = crq_queue_next_crq(>queue)) != NULL) {
>   ibmvscsi_handle_crq(crq, hostdata);
>   crq->valid = VIOSRP_CRQ_FREE;
> + wmb();
>   }
>  
>   vio_enable_interrupts(vdev);
> @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data)
>   vio_disable_interrupts(vdev);
>   ibmvscsi_handle_crq(crq, hostdata);
>   crq->valid = VIOSRP_CRQ_FREE;
> + wmb();

Should this driver use virt_wmb instead?

Paolo

>   } else {
>   done = 1;
>   }
> 


Re: [PATCH 1/2] ibmvscsi: add vscsi hosts to global list_head

2016-12-08 Thread Martin K. Petersen
> "Tyrel" == Tyrel Datwyler  writes:

Tyrel> Add each vscsi host adatper to a new global list_head named
Tyrel> ibmvscsi_head. There is no functional change. This is meant
Tyrel> primarily as a convience for locating adatpers from within the
Tyrel> debugger or crash utility.

Applied 1+2 to 4.10/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] powerpc: Fix LPCR_VRMASD definition

2016-12-08 Thread Paul Mackerras
On Thu, Dec 08, 2016 at 11:29:30AM +0800, Jia He wrote:
> Fixes: a4b349540a ("powerpc/mm: Cleanup LPCR defines")
> Signed-off-by: Jia He 

Acked-by: Paul Mackerras 


Re: [PATCH] powerpc/mm: Fixup wrong LPCR_VRMASD value

2016-12-08 Thread Paul Mackerras
On Thu, Dec 08, 2016 at 09:12:13AM +0530, Aneesh Kumar K.V wrote:
> In commit a4b349540a26af ("powerpc/mm: Cleanup LPCR defines") we updated
> LPCR_VRMASD wrongly as below.
> 
> -#define   LPCR_VRMASD  (0x1ful << (63-16))
> +#define   LPCR_VRMASD_SH   47
> +#define   LPCR_VRMASD  (ASM_CONST(1) << LPCR_VRMASD_SH)
> 
> We initialize the VRMA bits in LPCR to 0x00 in kvm. Hence using a different
> mask value as above while updating lpcr should not have any impact.
> 
> This patch updates it to the correct value
> Fixes: a4b349540a26af ("powerpc/mm: Cleanup LPCR defines") we updated
> 
> Reported-by: Ram Pai 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/reg.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 9e1499f98def..1c17e208db78 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -337,7 +337,7 @@
>  #define   LPCR_DPFD_SH   52
>  #define   LPCR_DPFD  (ASM_CONST(7) << LPCR_DPFD_SH)
>  #define   LPCR_VRMASD_SH 47
> -#define   LPCR_VRMASD(ASM_CONST(1) << LPCR_VRMASD_SH)
> +#define   LPCR_VRMASD(ASM_CONST(1f) << LPCR_VRMASD_SH)

Don't you need an 0x in there?  Did you compile-test this?

Paul.


Re: [PATCH] powerpc/mm: Fixup wrong LPCR_VRMASD value

2016-12-08 Thread Ram Pai
On Thu, Dec 08, 2016 at 09:12:13AM +0530, Aneesh Kumar K.V wrote:
> In commit a4b349540a26af ("powerpc/mm: Cleanup LPCR defines") we updated
> LPCR_VRMASD wrongly as below.
> 
> -#define   LPCR_VRMASD  (0x1ful << (63-16))
> +#define   LPCR_VRMASD_SH   47
> +#define   LPCR_VRMASD  (ASM_CONST(1) << LPCR_VRMASD_SH)
> 
> We initialize the VRMA bits in LPCR to 0x00 in kvm. Hence using a different
> mask value as above while updating lpcr should not have any impact.
> 
> This patch updates it to the correct value
> Fixes: a4b349540a26af ("powerpc/mm: Cleanup LPCR defines") we updated
> 
> Reported-by: Ram Pai 

  actually this was reported by He Jia.

> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/reg.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 9e1499f98def..1c17e208db78 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -337,7 +337,7 @@
>  #define   LPCR_DPFD_SH   52
>  #define   LPCR_DPFD  (ASM_CONST(7) << LPCR_DPFD_SH)
>  #define   LPCR_VRMASD_SH 47
> -#define   LPCR_VRMASD(ASM_CONST(1) << LPCR_VRMASD_SH)
> +#define   LPCR_VRMASD(ASM_CONST(1f) << LPCR_VRMASD_SH)
  
Shouldn't this be 0x1f instead of 1f  ?

RP



Re: [PATCH 4/6] pseries: Add support for hash table resizing

2016-12-08 Thread kbuild test robot
Hi David,

[auto build test ERROR on v4.9-rc8]
[cannot apply to powerpc/next kvm/linux-next kvm-ppc/kvm-ppc-next next-20161208]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/David-Gibson/powerpc-Hash-Page-Table-resizing-for-PAPR-guests/20161208-145142
config: powerpc-ps3_defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   arch/powerpc/mm/hash_utils_64.c: In function 'hash64_debugfs':
>> arch/powerpc/mm/hash_utils_64.c:1838:45: error: 'powerpc_debugfs_root' 
>> undeclared (first use in this function)
 if (!debugfs_create_file("pft-size", 0600, powerpc_debugfs_root,
^~~~
   arch/powerpc/mm/hash_utils_64.c:1838:45: note: each undeclared identifier is 
reported only once for each function it appears in

vim +/powerpc_debugfs_root +1838 arch/powerpc/mm/hash_utils_64.c

  1832  
  1833  DEFINE_SIMPLE_ATTRIBUTE(fops_ppc64_pft_size,
  1834  ppc64_pft_size_get, ppc64_pft_size_set, 
"%llu\n");
  1835  
  1836  static int __init hash64_debugfs(void)
  1837  {
> 1838  if (!debugfs_create_file("pft-size", 0600, powerpc_debugfs_root,
  1839   NULL, _ppc64_pft_size)) {
  1840  pr_err("lpar: unable to create ppc64_pft_size debugsfs 
file\n");
  1841  }

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v3 04/15] livepatch/x86: add TIF_PATCH_PENDING thread flag

2016-12-08 Thread Andy Lutomirski
On Thu, Dec 8, 2016 at 10:08 AM, Josh Poimboeuf  wrote:
> Add the TIF_PATCH_PENDING thread flag to enable the new livepatch
> per-task consistency model for x86_64.  The bit getting set indicates
> the thread has a pending patch which needs to be applied when the thread
> exits the kernel.
>
> The bit is placed in the _TIF_ALLWORK_MASK macro, which results in
> exit_to_usermode_loop() calling klp_update_patch_state() when it's set.
>
> Signed-off-by: Josh Poimboeuf 

Acked-by: Andy Lutomirski 


[PATCH v3 15/15] livepatch: allow removal of a disabled patch

2016-12-08 Thread Josh Poimboeuf
From: Miroslav Benes 

Currently we do not allow patch module to unload since there is no
method to determine if a task is still running in the patched code.

The consistency model gives us the way because when the unpatching
finishes we know that all tasks were marked as safe to call an original
function. Thus every new call to the function calls the original code
and at the same time no task can be somewhere in the patched code,
because it had to leave that code to be marked as safe.

We can safely let the patch module go after that.

Completion is used for synchronization between module removal and sysfs
infrastructure in a similar way to commit 942e443127e9 ("module: Fix
mod->mkobj.kobj potentially freed too early").

Note that we still do not allow the removal for immediate model, that is
no consistency model. The module refcount may increase in this case if
somebody disables and enables the patch several times. This should not
cause any harm.

With this change a call to try_module_get() is moved to
__klp_enable_patch from klp_register_patch to make module reference
counting symmetric (module_put() is in a patch disable path) and to
allow to take a new reference to a disabled module when being enabled.

Also all kobject_put(>kobj) calls are moved outside of klp_mutex
lock protection to prevent a deadlock situation when
klp_unregister_patch is called and sysfs directories are removed. There
is no need to do the same for other kobject_put() callsites as we
currently do not have their sysfs counterparts.

Signed-off-by: Miroslav Benes 
Signed-off-by: Josh Poimboeuf 
---
 Documentation/livepatch/livepatch.txt | 29 -
 include/linux/livepatch.h |  3 ++
 kernel/livepatch/core.c   | 80 ++-
 kernel/livepatch/transition.c | 12 +-
 samples/livepatch/livepatch-sample.c  |  1 -
 5 files changed, 72 insertions(+), 53 deletions(-)

diff --git a/Documentation/livepatch/livepatch.txt 
b/Documentation/livepatch/livepatch.txt
index f87e742..b0eaaf8 100644
--- a/Documentation/livepatch/livepatch.txt
+++ b/Documentation/livepatch/livepatch.txt
@@ -265,8 +265,15 @@ section "Livepatch life-cycle" below for more details 
about these
 two operations.
 
 Module removal is only safe when there are no users of the underlying
-functions.  The immediate consistency model is not able to detect this;
-therefore livepatch modules cannot be removed. See "Limitations" below.
+functions. The immediate consistency model is not able to detect this. The
+code just redirects the functions at the very beginning and it does not
+check if the functions are in use. In other words, it knows when the
+functions get called but it does not know when the functions return.
+Therefore it cannot be decided when the livepatch module can be safely
+removed. This is solved by a hybrid consistency model. When the system is
+transitioned to a new patch state (patched/unpatched) it is guaranteed that
+no task sleeps or runs in the old code.
+
 
 5. Livepatch life-cycle
 ===
@@ -437,24 +444,6 @@ The current Livepatch implementation has several 
limitations:
 There is work in progress to remove this limitation.
 
 
-  + Livepatch modules can not be removed.
-
-The current implementation just redirects the functions at the very
-beginning. It does not check if the functions are in use. In other
-words, it knows when the functions get called but it does not
-know when the functions return. Therefore it can not decide when
-the livepatch module can be safely removed.
-
-This will get most likely solved once a more complex consistency model
-is supported. The idea is that a safe state for patching should also
-mean a safe state for removing the patch.
-
-Note that the patch itself might get disabled by writing zero
-to /sys/kernel/livepatch//enabled. It causes that the new
-code will not longer get called. But it does not guarantee
-that anyone is not sleeping anywhere in the new code.
-
-
   + Livepatch works reliably only when the dynamic ftrace is located at
 the very beginning of the function.
 
diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 8e06fe5..1959e52 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -23,6 +23,7 @@
 
 #include 
 #include 
+#include 
 
 #if IS_ENABLED(CONFIG_LIVEPATCH)
 
@@ -114,6 +115,7 @@ struct klp_object {
  * @list:  list node for global list of registered patches
  * @kobj:  kobject for sysfs resources
  * @enabled:   the patch is enabled (but operation may be incomplete)
+ * @finish:for waiting till it is safe to remove the patch module
  */
 struct klp_patch {
/* external */
@@ -125,6 +127,7 @@ struct klp_patch {
struct list_head list;
struct kobject kobj;
bool enabled;
+   struct completion finish;
 };
 
 #define 

[PATCH v3 14/15] livepatch: add /proc//patch_state

2016-12-08 Thread Josh Poimboeuf
Expose the per-task patch state value so users can determine which tasks
are holding up completion of a patching operation.

Signed-off-by: Josh Poimboeuf 
---
 Documentation/filesystems/proc.txt | 18 ++
 fs/proc/base.c | 15 +++
 2 files changed, 33 insertions(+)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 72624a1..85c501b 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -44,6 +44,7 @@ Table of Contents
   3.8   /proc//fdinfo/ - Information about opened file
   3.9   /proc//map_files - Information about memory mapped files
   3.10  /proc//timerslack_ns - Task timerslack value
+  3.11 /proc//patch_state - Livepatch patch operation state
 
   4Configuring procfs
   4.1  Mount options
@@ -1886,6 +1887,23 @@ Valid values are from 0 - ULLONG_MAX
 An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level
 permissions on the task specified to change its timerslack_ns value.
 
+3.11   /proc//patch_state - Livepatch patch operation state
+-
+When CONFIG_LIVEPATCH is enabled, this file displays the value of the
+patch state for the task.
+
+A value of '-1' indicates that no patch is in transition.
+
+A value of '0' indicates that a patch is in transition and the task is
+unpatched.  If the patch is being enabled, then the task hasn't been
+patched yet.  If the patch is being disabled, then the task has already
+been unpatched.
+
+A value of '1' indicates that a patch is in transition and the task is
+patched.  If the patch is being enabled, then the task has already been
+patched.  If the patch is being disabled, then the task hasn't been
+unpatched yet.
+
 
 --
 Configuring procfs
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5ea8363..2e1e012 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2841,6 +2841,15 @@ static int proc_pid_personality(struct seq_file *m, 
struct pid_namespace *ns,
return err;
 }
 
+#ifdef CONFIG_LIVEPATCH
+static int proc_pid_patch_state(struct seq_file *m, struct pid_namespace *ns,
+   struct pid *pid, struct task_struct *task)
+{
+   seq_printf(m, "%d\n", task->patch_state);
+   return 0;
+}
+#endif /* CONFIG_LIVEPATCH */
+
 /*
  * Thread groups
  */
@@ -2940,6 +2949,9 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("timers", S_IRUGO, proc_timers_operations),
 #endif
REG("timerslack_ns", S_IRUGO|S_IWUGO, 
proc_pid_set_timerslack_ns_operations),
+#ifdef CONFIG_LIVEPATCH
+   ONE("patch_state",  S_IRUSR, proc_pid_patch_state),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3320,6 +3332,9 @@ static const struct pid_entry tid_base_stuff[] = {
REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
REG("setgroups",  S_IRUGO|S_IWUSR, proc_setgroups_operations),
 #endif
+#ifdef CONFIG_LIVEPATCH
+   ONE("patch_state",  S_IRUSR, proc_pid_patch_state),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
-- 
2.7.4



[PATCH v3 13/15] livepatch: change to a per-task consistency model

2016-12-08 Thread Josh Poimboeuf
Change livepatch to use a basic per-task consistency model.  This is the
foundation which will eventually enable us to patch those ~10% of
security patches which change function or data semantics.  This is the
biggest remaining piece needed to make livepatch more generally useful.

This code stems from the design proposal made by Vojtech [1] in November
2014.  It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
consistency and syscall barrier switching combined with kpatch's stack
trace switching.  There are also a number of fallback options which make
it quite flexible.

Patches are applied on a per-task basis, when the task is deemed safe to
switch over.  When a patch is enabled, livepatch enters into a
transition state where tasks are converging to the patched state.
Usually this transition state can complete in a few seconds.  The same
sequence occurs when a patch is disabled, except the tasks converge from
the patched state to the unpatched state.

An interrupt handler inherits the patched state of the task it
interrupts.  The same is true for forked tasks: the child inherits the
patched state of the parent.

Livepatch uses several complementary approaches to determine when it's
safe to patch tasks:

1. The first and most effective approach is stack checking of sleeping
   tasks.  If no affected functions are on the stack of a given task,
   the task is patched.  In most cases this will patch most or all of
   the tasks on the first try.  Otherwise it'll keep trying
   periodically.  This option is only available if the architecture has
   reliable stacks (HAVE_RELIABLE_STACKTRACE).

2. The second approach, if needed, is kernel exit switching.  A
   task is switched when it returns to user space from a system call, a
   user space IRQ, or a signal.  It's useful in the following cases:

   a) Patching I/O-bound user tasks which are sleeping on an affected
  function.  In this case you have to send SIGSTOP and SIGCONT to
  force it to exit the kernel and be patched.
   b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
  then it will get patched the next time it gets interrupted by an
  IRQ.
   c) In the future it could be useful for applying patches for
  architectures which don't yet have HAVE_RELIABLE_STACKTRACE.  In
  this case you would have to signal most of the tasks on the
  system.  However this isn't supported yet because there's
  currently no way to patch kthreads without
  HAVE_RELIABLE_STACKTRACE.

3. For idle "swapper" tasks, since they don't ever exit the kernel, they
   instead have a klp_update_patch_state() call in the idle loop which
   allows them to be patched before the CPU enters the idle state.

   (Note there's not yet such an approach for kthreads.)

All the above approaches may be skipped by setting the 'immediate' flag
in the 'klp_patch' struct, which will disable per-task consistency and
patch all tasks immediately.  This can be useful if the patch doesn't
change any function or data semantics.  Note that, even with this flag
set, it's possible that some tasks may still be running with an old
version of the function, until that function returns.

There's also an 'immediate' flag in the 'klp_func' struct which allows
you to specify that certain functions in the patch can be applied
without per-task consistency.  This might be useful if you want to patch
a common function like schedule(), and the function change doesn't need
consistency but the rest of the patch does.

For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user
must set patch->immediate which causes all tasks to be patched
immediately.  This option should be used with care, only when the patch
doesn't change any function or data semantics.

In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE
may be allowed to use per-task consistency if we can come up with
another way to patch kthreads.

The /sys/kernel/livepatch//transition file shows whether a patch
is in transition.  Only a single patch (the topmost patch on the stack)
can be in transition at a given time.  A patch can remain in transition
indefinitely, if any of the tasks are stuck in the initial patch state.

A transition can be reversed and effectively canceled by writing the
opposite value to the /sys/kernel/livepatch//enabled file while
the transition is in progress.  Then all the tasks will attempt to
converge back to the original patch state.

[1] https://lkml.kernel.org/r/20141107140458.ga21...@suse.cz

Signed-off-by: Josh Poimboeuf 
---
 Documentation/ABI/testing/sysfs-kernel-livepatch |   8 +
 Documentation/livepatch/livepatch.txt| 127 +-
 include/linux/init_task.h|   9 +
 include/linux/livepatch.h|  40 +-
 include/linux/sched.h|   3 +
 kernel/fork.c|   3 +
 kernel/livepatch/Makefile|   

[PATCH v3 12/15] livepatch: store function sizes

2016-12-08 Thread Josh Poimboeuf
For the consistency model we'll need to know the sizes of the old and
new functions to determine if they're on the stacks of any tasks.

Signed-off-by: Josh Poimboeuf 
---
 include/linux/livepatch.h |  3 +++
 kernel/livepatch/core.c   | 16 
 2 files changed, 19 insertions(+)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 1e2eb91..1a5a93c 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -37,6 +37,8 @@
  * @old_addr:  the address of the function being patched
  * @kobj:  kobject for sysfs resources
  * @stack_node:list node for klp_ops func_stack list
+ * @old_size:  size of the old function
+ * @new_size:  size of the new function
  * @patched:   the func has been added to the klp_ops list
  */
 struct klp_func {
@@ -56,6 +58,7 @@ struct klp_func {
unsigned long old_addr;
struct kobject kobj;
struct list_head stack_node;
+   unsigned long old_size, new_size;
bool patched;
 };
 
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 8ca8a0e..fc160c6 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -584,6 +584,22 @@ static int klp_init_object_loaded(struct klp_patch *patch,
 >old_addr);
if (ret)
return ret;
+
+   ret = kallsyms_lookup_size_offset(func->old_addr,
+ >old_size, NULL);
+   if (!ret) {
+   pr_err("kallsyms size lookup failed for '%s'\n",
+  func->old_name);
+   return -ENOENT;
+   }
+
+   ret = kallsyms_lookup_size_offset((unsigned long)func->new_func,
+ >new_size, NULL);
+   if (!ret) {
+   pr_err("kallsyms size lookup failed for '%s' 
replacement\n",
+  func->old_name);
+   return -ENOENT;
+   }
}
 
return 0;
-- 
2.7.4



[PATCH v3 11/15] livepatch: use kstrtobool() in enabled_store()

2016-12-08 Thread Josh Poimboeuf
The sysfs enabled value is a boolean, so kstrtobool() is a better fit
for parsing the input string since it does the range checking for us.

Suggested-by: Petr Mladek 
Signed-off-by: Josh Poimboeuf 
---
 kernel/livepatch/core.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 6a137e1..8ca8a0e 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -408,26 +408,23 @@ static ssize_t enabled_store(struct kobject *kobj, struct 
kobj_attribute *attr,
 {
struct klp_patch *patch;
int ret;
-   unsigned long val;
+   bool enabled;
 
-   ret = kstrtoul(buf, 10, );
+   ret = kstrtobool(buf, );
if (ret)
return -EINVAL;
 
-   if (val > 1)
-   return -EINVAL;
-
patch = container_of(kobj, struct klp_patch, kobj);
 
mutex_lock(_mutex);
 
-   if (patch->enabled == val) {
+   if (patch->enabled == enabled) {
/* already in requested state */
ret = -EINVAL;
goto err;
}
 
-   if (val) {
+   if (enabled) {
ret = __klp_enable_patch(patch);
if (ret)
goto err;
-- 
2.7.4



[PATCH v3 10/15] livepatch: move patching functions into patch.c

2016-12-08 Thread Josh Poimboeuf
Move functions related to the actual patching of functions and objects
into a new patch.c file.

Signed-off-by: Josh Poimboeuf 
---
 kernel/livepatch/Makefile |   2 +-
 kernel/livepatch/core.c   | 202 +--
 kernel/livepatch/patch.c  | 213 ++
 kernel/livepatch/patch.h  |  32 +++
 4 files changed, 247 insertions(+), 202 deletions(-)
 create mode 100644 kernel/livepatch/patch.c
 create mode 100644 kernel/livepatch/patch.h

diff --git a/kernel/livepatch/Makefile b/kernel/livepatch/Makefile
index e8780c0..e136dad 100644
--- a/kernel/livepatch/Makefile
+++ b/kernel/livepatch/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_LIVEPATCH) += livepatch.o
 
-livepatch-objs := core.o
+livepatch-objs := core.o patch.o
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 47ed643..6a137e1 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -24,32 +24,13 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
-
-/**
- * struct klp_ops - structure for tracking registered ftrace ops structs
- *
- * A single ftrace_ops is shared between all enabled replacement functions
- * (klp_func structs) which have the same old_addr.  This allows the switch
- * between function versions to happen instantaneously by updating the klp_ops
- * struct's func_stack list.  The winner is the klp_func at the top of the
- * func_stack (front of the list).
- *
- * @node:  node for the global klp_ops list
- * @func_stack:list head for the stack of klp_func's (active func is 
on top)
- * @fops:  registered ftrace ops struct
- */
-struct klp_ops {
-   struct list_head node;
-   struct list_head func_stack;
-   struct ftrace_ops fops;
-};
+#include "patch.h"
 
 /*
  * The klp_mutex protects the global lists and state transitions of any
@@ -60,28 +41,12 @@ struct klp_ops {
 static DEFINE_MUTEX(klp_mutex);
 
 static LIST_HEAD(klp_patches);
-static LIST_HEAD(klp_ops);
 
 static struct kobject *klp_root_kobj;
 
 /* TODO: temporary stub */
 void klp_update_patch_state(struct task_struct *task) {}
 
-static struct klp_ops *klp_find_ops(unsigned long old_addr)
-{
-   struct klp_ops *ops;
-   struct klp_func *func;
-
-   list_for_each_entry(ops, _ops, node) {
-   func = list_first_entry(>func_stack, struct klp_func,
-   stack_node);
-   if (func->old_addr == old_addr)
-   return ops;
-   }
-
-   return NULL;
-}
-
 static bool klp_is_module(struct klp_object *obj)
 {
return obj->name;
@@ -314,171 +279,6 @@ static int klp_write_object_relocations(struct module 
*pmod,
return ret;
 }
 
-static void notrace klp_ftrace_handler(unsigned long ip,
-  unsigned long parent_ip,
-  struct ftrace_ops *fops,
-  struct pt_regs *regs)
-{
-   struct klp_ops *ops;
-   struct klp_func *func;
-
-   ops = container_of(fops, struct klp_ops, fops);
-
-   rcu_read_lock();
-   func = list_first_or_null_rcu(>func_stack, struct klp_func,
- stack_node);
-   if (WARN_ON_ONCE(!func))
-   goto unlock;
-
-   klp_arch_set_pc(regs, (unsigned long)func->new_func);
-unlock:
-   rcu_read_unlock();
-}
-
-/*
- * Convert a function address into the appropriate ftrace location.
- *
- * Usually this is just the address of the function, but on some architectures
- * it's more complicated so allow them to provide a custom behaviour.
- */
-#ifndef klp_get_ftrace_location
-static unsigned long klp_get_ftrace_location(unsigned long faddr)
-{
-   return faddr;
-}
-#endif
-
-static void klp_unpatch_func(struct klp_func *func)
-{
-   struct klp_ops *ops;
-
-   if (WARN_ON(!func->patched))
-   return;
-   if (WARN_ON(!func->old_addr))
-   return;
-
-   ops = klp_find_ops(func->old_addr);
-   if (WARN_ON(!ops))
-   return;
-
-   if (list_is_singular(>func_stack)) {
-   unsigned long ftrace_loc;
-
-   ftrace_loc = klp_get_ftrace_location(func->old_addr);
-   if (WARN_ON(!ftrace_loc))
-   return;
-
-   WARN_ON(unregister_ftrace_function(>fops));
-   WARN_ON(ftrace_set_filter_ip(>fops, ftrace_loc, 1, 0));
-
-   list_del_rcu(>stack_node);
-   list_del(>node);
-   kfree(ops);
-   } else {
-   list_del_rcu(>stack_node);
-   }
-
-   func->patched = false;
-}
-
-static int klp_patch_func(struct klp_func *func)
-{
-   struct klp_ops *ops;
-   int ret;
-
-   if (WARN_ON(!func->old_addr))
-   return -EINVAL;
-
-   if (WARN_ON(func->patched))
-   return -EINVAL;
-
-   

[PATCH v3 09/15] livepatch: remove unnecessary object loaded check

2016-12-08 Thread Josh Poimboeuf
klp_patch_object()'s callers already ensure that the object is loaded,
so its call to klp_is_object_loaded() is unnecessary.

This will also make it possible to move the patching code into a
separate file.

Signed-off-by: Josh Poimboeuf 
---
 kernel/livepatch/core.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 2dbd355..47ed643 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -467,9 +467,6 @@ static int klp_patch_object(struct klp_object *obj)
if (WARN_ON(obj->patched))
return -EINVAL;
 
-   if (WARN_ON(!klp_is_object_loaded(obj)))
-   return -EINVAL;
-
klp_for_each_func(obj, func) {
ret = klp_patch_func(func);
if (ret) {
-- 
2.7.4



[PATCH v3 03/15] livepatch: temporary stubs for klp_patch_pending() and klp_update_patch_state()

2016-12-08 Thread Josh Poimboeuf
Create temporary stubs for klp_patch_pending() and
klp_update_patch_state() so we can add TIF_PATCH_PENDING to different
architectures in separate patches without breaking build bisectability.

Signed-off-by: Josh Poimboeuf 
---
 include/linux/livepatch.h | 7 ++-
 kernel/livepatch/core.c   | 3 +++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 9072f04..60558d8 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -123,10 +123,15 @@ void arch_klp_init_object_loaded(struct klp_patch *patch,
 int klp_module_coming(struct module *mod);
 void klp_module_going(struct module *mod);
 
+static inline bool klp_patch_pending(struct task_struct *task) { return false; 
}
+void klp_update_patch_state(struct task_struct *task);
+
 #else /* !CONFIG_LIVEPATCH */
 
 static inline int klp_module_coming(struct module *mod) { return 0; }
-static inline void klp_module_going(struct module *mod) { }
+static inline void klp_module_going(struct module *mod) {}
+static inline bool klp_patch_pending(struct task_struct *task) { return false; 
}
+static inline void klp_update_patch_state(struct task_struct *task) {}
 
 #endif /* CONFIG_LIVEPATCH */
 
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index af46438..217b39d 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -64,6 +64,9 @@ static LIST_HEAD(klp_ops);
 
 static struct kobject *klp_root_kobj;
 
+/* TODO: temporary stub */
+void klp_update_patch_state(struct task_struct *task) {}
+
 static struct klp_ops *klp_find_ops(unsigned long old_addr)
 {
struct klp_ops *ops;
-- 
2.7.4



[PATCH v3 08/15] livepatch: separate enabled and patched states

2016-12-08 Thread Josh Poimboeuf
Once we have a consistency model, patches and their objects will be
enabled and disabled at different times.  For example, when a patch is
disabled, its loaded objects' funcs can remain registered with ftrace
indefinitely until the unpatching operation is complete and they're no
longer in use.

It's less confusing if we give them different names: patches can be
enabled or disabled; objects (and their funcs) can be patched or
unpatched:

- Enabled means that a patch is logically enabled (but not necessarily
  fully applied).

- Patched means that an object's funcs are registered with ftrace and
  added to the klp_ops func stack.

Also, since these states are binary, represent them with booleans
instead of ints.

Signed-off-by: Josh Poimboeuf 
---
 include/linux/livepatch.h | 17 ---
 kernel/livepatch/core.c   | 72 +++
 2 files changed, 42 insertions(+), 47 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 60558d8..1e2eb91 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -28,11 +28,6 @@
 
 #include 
 
-enum klp_state {
-   KLP_DISABLED,
-   KLP_ENABLED
-};
-
 /**
  * struct klp_func - function structure for live patching
  * @old_name:  name of the function to be patched
@@ -41,8 +36,8 @@ enum klp_state {
  * can be found (optional)
  * @old_addr:  the address of the function being patched
  * @kobj:  kobject for sysfs resources
- * @state: tracks function-level patch application state
  * @stack_node:list node for klp_ops func_stack list
+ * @patched:   the func has been added to the klp_ops list
  */
 struct klp_func {
/* external */
@@ -60,8 +55,8 @@ struct klp_func {
/* internal */
unsigned long old_addr;
struct kobject kobj;
-   enum klp_state state;
struct list_head stack_node;
+   bool patched;
 };
 
 /**
@@ -71,7 +66,7 @@ struct klp_func {
  * @kobj:  kobject for sysfs resources
  * @mod:   kernel module associated with the patched object
  * (NULL for vmlinux)
- * @state: tracks object-level patch application state
+ * @patched:   the object's funcs have been added to the klp_ops list
  */
 struct klp_object {
/* external */
@@ -81,7 +76,7 @@ struct klp_object {
/* internal */
struct kobject kobj;
struct module *mod;
-   enum klp_state state;
+   bool patched;
 };
 
 /**
@@ -90,7 +85,7 @@ struct klp_object {
  * @objs:  object entries for kernel objects to be patched
  * @list:  list node for global list of registered patches
  * @kobj:  kobject for sysfs resources
- * @state: tracks patch-level application state
+ * @enabled:   the patch is enabled (but operation may be incomplete)
  */
 struct klp_patch {
/* external */
@@ -100,7 +95,7 @@ struct klp_patch {
/* internal */
struct list_head list;
struct kobject kobj;
-   enum klp_state state;
+   bool enabled;
 };
 
 #define klp_for_each_object(patch, obj) \
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 217b39d..2dbd355 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -348,11 +348,11 @@ static unsigned long klp_get_ftrace_location(unsigned 
long faddr)
 }
 #endif
 
-static void klp_disable_func(struct klp_func *func)
+static void klp_unpatch_func(struct klp_func *func)
 {
struct klp_ops *ops;
 
-   if (WARN_ON(func->state != KLP_ENABLED))
+   if (WARN_ON(!func->patched))
return;
if (WARN_ON(!func->old_addr))
return;
@@ -378,10 +378,10 @@ static void klp_disable_func(struct klp_func *func)
list_del_rcu(>stack_node);
}
 
-   func->state = KLP_DISABLED;
+   func->patched = false;
 }
 
-static int klp_enable_func(struct klp_func *func)
+static int klp_patch_func(struct klp_func *func)
 {
struct klp_ops *ops;
int ret;
@@ -389,7 +389,7 @@ static int klp_enable_func(struct klp_func *func)
if (WARN_ON(!func->old_addr))
return -EINVAL;
 
-   if (WARN_ON(func->state != KLP_DISABLED))
+   if (WARN_ON(func->patched))
return -EINVAL;
 
ops = klp_find_ops(func->old_addr);
@@ -437,7 +437,7 @@ static int klp_enable_func(struct klp_func *func)
list_add_rcu(>stack_node, >func_stack);
}
 
-   func->state = KLP_ENABLED;
+   func->patched = true;
 
return 0;
 
@@ -448,36 +448,36 @@ static int klp_enable_func(struct klp_func *func)
return ret;
 }
 
-static void klp_disable_object(struct klp_object *obj)
+static void klp_unpatch_object(struct klp_object *obj)
 {
struct klp_func *func;
 
klp_for_each_func(obj, func)
-   if (func->state == KLP_ENABLED)
-   klp_disable_func(func);
+   if (func->patched)
+   

[PATCH v3 07/15] livepatch/s390: add TIF_PATCH_PENDING thread flag

2016-12-08 Thread Josh Poimboeuf
From: Miroslav Benes 

Update a task's patch state when returning from a system call or user
space interrupt, or after handling a signal.

This greatly increases the chances of a patch operation succeeding.  If
a task is I/O bound, it can be patched when returning from a system
call.  If a task is CPU bound, it can be patched when returning from an
interrupt.  If a task is sleeping on a to-be-patched function, the user
can send SIGSTOP and SIGCONT to force it to switch.

Since there are two ways the syscall can be restarted on return from a
signal handling process, it is important to clear the flag before
do_signal() is called. Otherwise we could miss the migration if we used
SIGSTOP/SIGCONT procedure or fake signal to migrate patching blocking
tasks. If we place our hook to sysc_work label in entry before
TIF_SIGPENDING is evaluated we kill two birds with one stone. The task
is correctly migrated in all return paths from a syscall.

Signed-off-by: Miroslav Benes 
Signed-off-by: Josh Poimboeuf 
---
 arch/s390/include/asm/thread_info.h |  2 ++
 arch/s390/kernel/entry.S| 31 ++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/s390/include/asm/thread_info.h 
b/arch/s390/include/asm/thread_info.h
index 4977668..646845e 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -56,6 +56,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct 
task_struct *src);
 #define TIF_SIGPENDING 1   /* signal pending */
 #define TIF_NEED_RESCHED   2   /* rescheduling necessary */
 #define TIF_UPROBE 3   /* breakpointed or single-stepping */
+#define TIF_PATCH_PENDING  4   /* pending live patching update */
 
 #define TIF_31BIT  16  /* 32bit process */
 #define TIF_MEMDIE 17  /* is terminating due to OOM killer */
@@ -74,6 +75,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct 
task_struct *src);
 #define _TIF_SIGPENDING_BITUL(TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED  _BITUL(TIF_NEED_RESCHED)
 #define _TIF_UPROBE_BITUL(TIF_UPROBE)
+#define _TIF_PATCH_PENDING _BITUL(TIF_PATCH_PENDING)
 
 #define _TIF_31BIT _BITUL(TIF_31BIT)
 #define _TIF_SINGLE_STEP   _BITUL(TIF_SINGLE_STEP)
diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S
index 161f4e6..33848a8 100644
--- a/arch/s390/kernel/entry.S
+++ b/arch/s390/kernel/entry.S
@@ -47,7 +47,7 @@ STACK_SIZE  = 1 << STACK_SHIFT
 STACK_INIT = STACK_SIZE - STACK_FRAME_OVERHEAD - __PT_SIZE
 
 _TIF_WORK  = (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED | \
-  _TIF_UPROBE)
+  _TIF_UPROBE | _TIF_PATCH_PENDING)
 _TIF_TRACE = (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SECCOMP | \
   _TIF_SYSCALL_TRACEPOINT)
 _CIF_WORK  = (_CIF_MCCK_PENDING | _CIF_ASCE | _CIF_FPU)
@@ -352,6 +352,11 @@ ENTRY(system_call)
 #endif
TSTMSK  __PT_FLAGS(%r11),_PIF_PER_TRAP
jo  .Lsysc_singlestep
+#ifdef CONFIG_LIVEPATCH
+   TSTMSK  __TI_flags(%r12),_TIF_PATCH_PENDING
+   jo  .Lsysc_patch_pending# handle live patching just before
+   # signals and possible syscall restart
+#endif
TSTMSK  __TI_flags(%r12),_TIF_SIGPENDING
jo  .Lsysc_sigpending
TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_RESUME
@@ -426,6 +431,16 @@ ENTRY(system_call)
 #endif
 
 #
+# _TIF_PATCH_PENDING is set, call klp_update_patch_state
+#
+#ifdef CONFIG_LIVEPATCH
+.Lsysc_patch_pending:
+   lg  %r2,__LC_CURRENT# pass pointer to task struct
+   larl%r14,.Lsysc_return
+   jg  klp_update_patch_state
+#endif
+
+#
 # _PIF_PER_TRAP is set, call do_per_trap
 #
 .Lsysc_singlestep:
@@ -674,6 +689,10 @@ ENTRY(io_int_handler)
jo  .Lio_mcck_pending
TSTMSK  __TI_flags(%r12),_TIF_NEED_RESCHED
jo  .Lio_reschedule
+#ifdef CONFIG_LIVEPATCH
+   TSTMSK  __TI_flags(%r12),_TIF_PATCH_PENDING
+   jo  .Lio_patch_pending
+#endif
TSTMSK  __TI_flags(%r12),_TIF_SIGPENDING
jo  .Lio_sigpending
TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_RESUME
@@ -720,6 +739,16 @@ ENTRY(io_int_handler)
j   .Lio_return
 
 #
+# _TIF_PATCH_PENDING is set, call klp_update_patch_state
+#
+#ifdef CONFIG_LIVEPATCH
+.Lio_patch_pending:
+   lg  %r2,__LC_CURRENT# pass pointer to task struct
+   larl%r14,.Lio_return
+   jg  klp_update_patch_state
+#endif
+
+#
 # _TIF_SIGPENDING or is set, call do_signal
 #
 .Lio_sigpending:
-- 
2.7.4



[PATCH v3 06/15] livepatch/s390: reorganize TIF thread flag bits

2016-12-08 Thread Josh Poimboeuf
From: Jiri Slaby 

Group the TIF thread flag bits by their inclusion in the _TIF_WORK and
_TIF_TRACE macros.

Signed-off-by: Jiri Slaby 
Signed-off-by: Josh Poimboeuf 
---
 arch/s390/include/asm/thread_info.h | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/s390/include/asm/thread_info.h 
b/arch/s390/include/asm/thread_info.h
index a5b54a4..4977668 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -51,14 +51,12 @@ int arch_dup_task_struct(struct task_struct *dst, struct 
task_struct *src);
 /*
  * thread information flags bit numbers
  */
+/* _TIF_WORK bits */
 #define TIF_NOTIFY_RESUME  0   /* callback before returning to user */
 #define TIF_SIGPENDING 1   /* signal pending */
 #define TIF_NEED_RESCHED   2   /* rescheduling necessary */
-#define TIF_SYSCALL_TRACE  3   /* syscall trace active */
-#define TIF_SYSCALL_AUDIT  4   /* syscall auditing active */
-#define TIF_SECCOMP5   /* secure computing */
-#define TIF_SYSCALL_TRACEPOINT 6   /* syscall tracepoint instrumentation */
-#define TIF_UPROBE 7   /* breakpointed or single-stepping */
+#define TIF_UPROBE 3   /* breakpointed or single-stepping */
+
 #define TIF_31BIT  16  /* 32bit process */
 #define TIF_MEMDIE 17  /* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK18  /* restore signal mask in do_signal() */
@@ -66,15 +64,23 @@ int arch_dup_task_struct(struct task_struct *dst, struct 
task_struct *src);
 #define TIF_BLOCK_STEP 20  /* This task is block stepped */
 #define TIF_UPROBE_SINGLESTEP  21  /* This task is uprobe single stepped */
 
+/* _TIF_TRACE bits */
+#define TIF_SYSCALL_TRACE  24  /* syscall trace active */
+#define TIF_SYSCALL_AUDIT  25  /* syscall auditing active */
+#define TIF_SECCOMP26  /* secure computing */
+#define TIF_SYSCALL_TRACEPOINT 27  /* syscall tracepoint instrumentation */
+
 #define _TIF_NOTIFY_RESUME _BITUL(TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING_BITUL(TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED  _BITUL(TIF_NEED_RESCHED)
+#define _TIF_UPROBE_BITUL(TIF_UPROBE)
+
+#define _TIF_31BIT _BITUL(TIF_31BIT)
+#define _TIF_SINGLE_STEP   _BITUL(TIF_SINGLE_STEP)
+
 #define _TIF_SYSCALL_TRACE _BITUL(TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT _BITUL(TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP   _BITUL(TIF_SECCOMP)
 #define _TIF_SYSCALL_TRACEPOINT_BITUL(TIF_SYSCALL_TRACEPOINT)
-#define _TIF_UPROBE_BITUL(TIF_UPROBE)
-#define _TIF_31BIT _BITUL(TIF_31BIT)
-#define _TIF_SINGLE_STEP   _BITUL(TIF_SINGLE_STEP)
 
 #endif /* _ASM_THREAD_INFO_H */
-- 
2.7.4



[PATCH v3 05/15] livepatch/powerpc: add TIF_PATCH_PENDING thread flag

2016-12-08 Thread Josh Poimboeuf
Add the TIF_PATCH_PENDING thread flag to enable the new livepatch
per-task consistency model for powerpc.  The bit getting set indicates
the thread has a pending patch which needs to be applied when the thread
exits the kernel.

The bit is included in the _TIF_USER_WORK_MASK macro so that
do_notify_resume() and klp_update_patch_state() get called when the bit
is set.

Signed-off-by: Josh Poimboeuf 
---
 arch/powerpc/include/asm/thread_info.h | 4 +++-
 arch/powerpc/kernel/signal.c   | 4 
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index 87e4b2d..6fc6464 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -92,6 +92,7 @@ static inline struct thread_info *current_thread_info(void)
   TIF_NEED_RESCHED */
 #define TIF_32BIT  4   /* 32 bit binary */
 #define TIF_RESTORE_TM 5   /* need to restore TM FP/VEC/VSX */
+#define TIF_PATCH_PENDING  6   /* pending live patching update */
 #define TIF_SYSCALL_AUDIT  7   /* syscall auditing active */
 #define TIF_SINGLESTEP 8   /* singlestepping active */
 #define TIF_NOHZ   9   /* in adaptive nohz mode */
@@ -115,6 +116,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_POLLING_NRFLAG(1<

[PATCH v3 04/15] livepatch/x86: add TIF_PATCH_PENDING thread flag

2016-12-08 Thread Josh Poimboeuf
Add the TIF_PATCH_PENDING thread flag to enable the new livepatch
per-task consistency model for x86_64.  The bit getting set indicates
the thread has a pending patch which needs to be applied when the thread
exits the kernel.

The bit is placed in the _TIF_ALLWORK_MASK macro, which results in
exit_to_usermode_loop() calling klp_update_patch_state() when it's set.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/entry/common.c| 9 ++---
 arch/x86/include/asm/thread_info.h | 4 +++-
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index bdd9cc5..16a51a5 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -129,14 +130,13 @@ static long syscall_trace_enter(struct pt_regs *regs)
 
 #define EXIT_TO_USERMODE_LOOP_FLAGS\
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |   \
-_TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY)
+_TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 {
/*
 * In order to return to user mode, we need to have IRQs off with
-* none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
-* _TIF_UPROBE, or _TIF_NEED_RESCHED set.  Several of these flags
+* none of EXIT_TO_USERMODE_LOOP_FLAGS set.  Several of these flags
 * can be set at any time on preemptable kernels if we have IRQs on,
 * so we need to loop.  Disabling preemption wouldn't help: doing the
 * work to clear some of the flags can sleep.
@@ -163,6 +163,9 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 
cached_flags)
if (cached_flags & _TIF_USER_RETURN_NOTIFY)
fire_user_return_notifiers();
 
+   if (cached_flags & _TIF_PATCH_PENDING)
+   klp_update_patch_state(current);
+
/* Disable IRQs and retry */
local_irq_disable();
 
diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 1fe6043..79f4d6a 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,6 +84,7 @@ struct thread_info {
 #define TIF_SECCOMP8   /* secure computing */
 #define TIF_USER_RETURN_NOTIFY 11  /* notify kernel of userspace return */
 #define TIF_UPROBE 12  /* breakpointed or singlestepping */
+#define TIF_PATCH_PENDING  13  /* pending live patching update */
 #define TIF_NOTSC  16  /* TSC is not accessible in userland */
 #define TIF_IA32   17  /* IA32 compatibility process */
 #define TIF_NOHZ   19  /* in adaptive nohz mode */
@@ -107,6 +108,7 @@ struct thread_info {
 #define _TIF_SECCOMP   (1 << TIF_SECCOMP)
 #define _TIF_USER_RETURN_NOTIFY(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE(1 << TIF_UPROBE)
+#define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING)
 #define _TIF_NOTSC (1 << TIF_NOTSC)
 #define _TIF_IA32  (1 << TIF_IA32)
 #define _TIF_NOHZ  (1 << TIF_NOHZ)
@@ -133,7 +135,7 @@ struct thread_info {
(_TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME | _TIF_SIGPENDING |\
 _TIF_SINGLESTEP | _TIF_NEED_RESCHED | _TIF_SYSCALL_EMU |   \
 _TIF_SYSCALL_AUDIT | _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE |   \
-_TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ)
+_TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ | _TIF_PATCH_PENDING)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW
\
-- 
2.7.4



[PATCH v3 02/15] x86/entry: define _TIF_ALLWORK_MASK flags explicitly

2016-12-08 Thread Josh Poimboeuf
The _TIF_ALLWORK_MASK macro automatically includes the least-significant
16 bits of the thread_info flags, which is less than obvious and tends
to create confusion and surprises when reading or modifying the code.

Define the flags explicitly.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/include/asm/thread_info.h | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index ad6f5eb0..1fe6043 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -73,9 +73,6 @@ struct thread_info {
  * thread information flags
  * - these are process state flags that various assembly files
  *   may need to access
- * - pending work-to-be-done flags are in LSW
- * - other flags in MSW
- * Warning: layout of LSW is hardcoded in entry.S
  */
 #define TIF_SYSCALL_TRACE  0   /* syscall trace active */
 #define TIF_NOTIFY_RESUME  1   /* callback before returning to user */
@@ -133,8 +130,10 @@ struct thread_info {
 
 /* work to do on any return to user space */
 #define _TIF_ALLWORK_MASK  \
-   ((0x & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT |   \
-   _TIF_NOHZ)
+   (_TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME | _TIF_SIGPENDING |\
+_TIF_SINGLESTEP | _TIF_NEED_RESCHED | _TIF_SYSCALL_EMU |   \
+_TIF_SYSCALL_AUDIT | _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE |   \
+_TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW
\
-- 
2.7.4



[PATCH v3 01/15] stacktrace/x86: add function for detecting reliable stack traces

2016-12-08 Thread Josh Poimboeuf
For live patching and possibly other use cases, a stack trace is only
useful if it can be assured that it's completely reliable.  Add a new
save_stack_trace_tsk_reliable() function to achieve that.

Scenarios which indicate that a stack trace may be unreliable:

- running task
- interrupt stack
- preemption
- corrupted stack data
- stack grows the wrong way
- stack walk doesn't reach the bottom
- user didn't provide a large enough entries array

Also add CONFIG_HAVE_RELIABLE_STACKTRACE so arch-independent code can
determine at build time whether the function is implemented.

Signed-off-by: Josh Poimboeuf 
---
 arch/Kconfig   |  6 +
 arch/x86/Kconfig   |  1 +
 arch/x86/include/asm/unwind.h  |  6 +
 arch/x86/kernel/stacktrace.c   | 59 +-
 arch/x86/kernel/unwind_frame.c |  1 +
 include/linux/stacktrace.h |  8 +++---
 kernel/stacktrace.c| 12 +++--
 7 files changed, 87 insertions(+), 6 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 13f27c1..d61a133 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -678,6 +678,12 @@ config HAVE_STACK_VALIDATION
  Architecture supports the 'objtool check' host tool command, which
  performs compile-time stack metadata validation.
 
+config HAVE_RELIABLE_STACKTRACE
+   bool
+   help
+ Architecture has a save_stack_trace_tsk_reliable() function which
+ only returns a stack trace if it can guarantee the trace is reliable.
+
 config HAVE_ARCH_HASH
bool
default n
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 215612c..b4a6663 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -155,6 +155,7 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
+   select HAVE_RELIABLE_STACKTRACE if X86_64 && FRAME_POINTER && 
STACK_VALIDATION
select HAVE_STACK_VALIDATIONif X86_64
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK
diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index c5a7f3a..44f86dc 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -11,6 +11,7 @@ struct unwind_state {
unsigned long stack_mask;
struct task_struct *task;
int graph_idx;
+   bool error;
 #ifdef CONFIG_FRAME_POINTER
unsigned long *bp;
struct pt_regs *regs;
@@ -40,6 +41,11 @@ void unwind_start(struct unwind_state *state, struct 
task_struct *task,
__unwind_start(state, task, regs, first_frame);
 }
 
+static inline bool unwind_error(struct unwind_state *state)
+{
+   return state->error;
+}
+
 #ifdef CONFIG_FRAME_POINTER
 
 static inline
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 0653788..3e0cf5e 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -74,6 +74,64 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct 
stack_trace *trace)
 }
 EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
 
+#ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
+static int __save_stack_trace_reliable(struct stack_trace *trace,
+  struct task_struct *task)
+{
+   struct unwind_state state;
+   struct pt_regs *regs;
+   unsigned long addr;
+
+   for (unwind_start(, task, NULL, NULL); !unwind_done();
+unwind_next_frame()) {
+
+   regs = unwind_get_entry_regs();
+   if (regs) {
+   /*
+* Preemption and page faults on the stack can make
+* frame pointers unreliable.
+*/
+   if (!user_mode(regs))
+   return -1;
+
+   /*
+* This frame contains the (user mode) pt_regs at the
+* end of the stack.  Finish the unwind.
+*/
+   unwind_next_frame();
+   break;
+   }
+
+   addr = unwind_get_return_address();
+   if (!addr || save_stack_address(trace, addr, false))
+   return -1;
+   }
+
+   if (!unwind_done() || unwind_error())
+   return -1;
+
+   if (trace->nr_entries < trace->max_entries)
+   trace->entries[trace->nr_entries++] = ULONG_MAX;
+
+   return 0;
+}
+
+int save_stack_trace_tsk_reliable(struct task_struct *tsk,
+ struct stack_trace *trace)
+{
+   int ret;
+
+   if (!try_get_task_stack(tsk))
+   return -EINVAL;
+
+   ret = __save_stack_trace_reliable(trace, tsk);
+
+   put_task_stack(tsk);
+
+   return ret;
+}
+#endif /* CONFIG_HAVE_RELIABLE_STACKTRACE */
+
 /* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
 
 struct 

[PATCH v3 00/15] livepatch: hybrid consistency model

2016-12-08 Thread Josh Poimboeuf
Dusting the cobwebs off the consistency model again.  This is based on
linux-next/master.

v1 was posted on 2015-02-09:

  https://lkml.kernel.org/r/cover.1423499826.git.jpoim...@redhat.com

v2 was posted on 2016-04-28:

  https://lkml.kernel.org/r/cover.1461875890.git.jpoim...@redhat.com

The biggest issue from v2 was finding a decent way to detect preemption
and page faults on the stack of a sleeping task.  That problem was
solved by rewriting the x86 stack unwinder.  The new unwinder helps
detect such cases by finding all pt_regs on the stack.  When
preemption/page faults are detected, the stack is considered unreliable
and the patching of the task is deferred.

For more details about the consistency model, see patch 13/15.

---

v3:
- rebase on new x86 unwinder
- force !HAVE_RELIABLE_STACKTRACE arches to use patch->immediate for
  now, because we don't have a way to transition kthreads otherwise
- rebase s390 TIF_PATCH_PENDING patch onto latest entry code
- update barrier comments and move barrier from the end of
  klp_init_transition() to its callers
- "klp_work" -> "klp_transition_work"
- "klp_patch_task()" -> "klp_update_patch_state()"
- explicit _TIF_ALLWORK_MASK
- change klp_reverse_transition() to not try to complete transition.
  instead modify the work queue delay to zero.
- get rid of klp_schedule_work() in favor of calling
  schedule_delayed_work() directly with a KLP_TRANSITION_DELAY
- initialize klp_target_state to KLP_UNDEFINED
- move klp_target_state assignment to before patch->immediate check in
  klp_init_transition()
- rcu_read_lock() in klp_update_patch_state(), test the thread flag in
  patch task, synchronize_rcu() in klp_complete_transition()
- use kstrtobool() in enabled_store()
- change task_rq_lock() argument type to struct rq_flags
- add several WARN_ON_ONCE assertions for klp_target_state and
  task->patch_state

v2:
- "universe" -> "patch state"
- rename klp_update_task_universe() -> klp_patch_task()
- add preempt IRQ tracking (TF_PREEMPT_IRQ)
- fix print_context_stack_reliable() bug
- improve print_context_stack_reliable() comments
- klp_ftrace_handler comment fixes
- add "patch_state" proc file to tid_base_stuff
- schedule work even for !RELIABLE_STACKTRACE
- forked child inherits patch state from parent
- add detailed comment to livepatch.h klp_func definition about the
  klp_func patched/transition state transitions
- update exit_to_usermode_loop() comment
- clear all TIF_KLP_NEED_UPDATE flags in klp_complete_transition()
- remove unnecessary function externs
- add livepatch documentation, sysfs documentation, /proc documentation
- /proc/pid/patch_state: -1 means no patch is currently being applied/reverted
- "TIF_KLP_NEED_UPDATE" -> "TIF_PATCH_PENDING"
- support for s390 and powerpc-le
- don't assume stacks with dynamic ftrace trampolines are reliable
- add _TIF_ALLWORK_MASK info to commit log

v1.9:
- revive from the dead and rebased
- reliable stacks!
- add support for immediate consistency model
- add a ton of comments
- fix up memory barriers
- remove "allow patch modules to be removed" patch for now, it still 
  needs more discussion and thought - it can be done with something
- "proc/pid/universe" -> "proc/pid/patch_status"
- remove WARN_ON_ONCE from !func condition in ftrace handler -- can
  happen because of RCU
- keep klp_mutex private by putting the work_fn in core.c
- convert states from int to boolean
- remove obsolete '@state' comments
- several header file and include improvements suggested by Jiri S
- change kallsyms_lookup_size_offset() errors from EINVAL -> ENOENT
- change proc file permissions S_IRUGO -> USR
- use klp_for_each_object/func helpers

---

Jiri Slaby (1):
  livepatch/s390: reorganize TIF thread flag bits

Josh Poimboeuf (12):
  stacktrace/x86: add function for detecting reliable stack traces
  x86/entry: define _TIF_ALLWORK_MASK flags explicitly
  livepatch: temporary stubs for klp_patch_pending() and
klp_update_patch_state()
  livepatch/x86: add TIF_PATCH_PENDING thread flag
  livepatch/powerpc: add TIF_PATCH_PENDING thread flag
  livepatch: separate enabled and patched states
  livepatch: remove unnecessary object loaded check
  livepatch: move patching functions into patch.c
  livepatch: use kstrtobool() in enabled_store()
  livepatch: store function sizes
  livepatch: change to a per-task consistency model
  livepatch: add /proc//patch_state

Miroslav Benes (2):
  livepatch/s390: add TIF_PATCH_PENDING thread flag
  livepatch: allow removal of a disabled patch

 Documentation/ABI/testing/sysfs-kernel-livepatch |   8 +
 Documentation/filesystems/proc.txt   |  18 +
 Documentation/livepatch/livepatch.txt| 156 ++--
 arch/Kconfig |   6 +
 arch/powerpc/include/asm/thread_info.h   |   4 +-
 arch/powerpc/kernel/signal.c |   4 +
 arch/s390/include/asm/thread_info.h  |  24 +-
 arch/s390/kernel/entry.S |  31 +-
 

Re: [PATCH 3/3] powerpc: enable support for GCC plugins

2016-12-08 Thread Kees Cook
On Thu, Dec 8, 2016 at 6:42 AM, PaX Team  wrote:
> On 6 Dec 2016 at 17:28, Andrew Donnellan wrote:
>
>> Enable support for GCC plugins on powerpc.
>>
>> Add an additional version check in gcc-plugins-check to advise users to
>> upgrade to gcc 5.2+ on powerpc to avoid issues with header files (gcc <=
>> 4.6) or missing copies of rs6000-cpus.def (4.8 to 5.1 on 64-bit targets).
>
> i don't think that this is the right approach. there's a general and a special
> issue here, both of which need different handling.
>
> the general problem is to detect problems related to gcc plugin headers and
> notify the users about solutions. emitting various messages from a Makefile
> is certainly not a scalable approach, just imagine how it will look when the
> other 30+ archs begin to add their own special cases... if anything, they
> should be documented in Documentation/gcc-plugins.txt (or a new doc if it
> grows too big) and the Makefile message should just point at it.
>
> as for the solutions, the general advice should enable the use of otherwise
> failing gcc versions instead of forcing updating to new ones (though the
> latter is advisable for other reasons but not everyone's in the position to
> do so easily). in my experience all one needs to do is manually install the
> missing files from the gcc sources (ideally distros would take care of it).
>
> the specific problem addressed here can (and IMHO should) be solved in
> another way: remove the inclusion of the offending headers in gcc-common.h
> as neither tm.h nor c-common.h are needed by existing plugins. for background,
> i created gcc-common.h to simplify plugin development across all supportable
> gcc versions i came across over the years, so it follows the 'everything but
> the kitchen sink' approach. that isn't necessarily what the kernel and other
> projects need so they should just use my version as a basis and fork/simplify
> it (even i maintain private forks of the public version).

If removing those will lower the requirement for PPC, that would be
ideal. Otherwise, I'd like to take the practical approach of making
the plugins available on PPC right now, with an eye towards relaxing
the version requirement as people need it.

> as for the location of c-common.h, upstream gcc moved it under c-family in
> 2010 after the release of 4.5, so it should be where gcc-common.h expects
> it and i'm not sure how it ended up at its old location for you.

That is rather odd. What distro was the PPC test done on? (Or were
these manually built gcc versions?)

-Kees

-- 
Kees Cook
Nexus Security


Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO

2016-12-08 Thread Alex Williamson
On Thu,  8 Dec 2016 19:19:56 +1100
Alexey Kardashevskiy  wrote:

> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is referenced so we do not have to retrieve in real mode when hypercall
> happens.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is detroyed so this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> This uses the kvm->lock mutex to protect against a race between
> the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
> release() callback.
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  21 +-
>  arch/powerpc/include/asm/kvm_host.h|   8 +
>  arch/powerpc/include/asm/kvm_ppc.h |   5 +
>  include/uapi/linux/kvm.h   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c   | 302 
> +
>  arch/powerpc/kvm/book3s_64_vio_hv.c| 178 +
>  arch/powerpc/kvm/powerpc.c |   2 +
>  virt/kvm/vfio.c| 108 +++
>  8 files changed, 630 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
> b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..ddb5a6512ab3 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,24 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> + kvm_device_attr.addr points to an int32_t file descriptor
> + for the VFIO group.
>KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> + kvm_device_attr.addr points to an int32_t file descriptor
> + for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> + allocated by sPAPR KVM.
> + kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> + struct kvm_vfio_spapr_tce {
> + __u32   argsz;
> + __s32   groupfd;
> + __s32   tablefd;
> + __u8pad[4];
> + };
> +
> + where
> + @argsz is the size of kvm_vfio_spapr_tce_liobn;
> + @groupfd is a file descriptor for a VFIO group;
> + @tablefd is a file descriptor for a TCE table allocated via
> + KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index 28350a294b1e..94774503c70d 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>   atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> + struct rcu_head rcu;
> + struct list_head next;
> + struct iommu_table *tbl;
> + atomic_t refs;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>   struct list_head list;
>   struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>   u32 page_shift;
>   u64 offset; /* in pages */
>   u64 size;   /* window size in pages */
> + struct list_head iommu_tables;
>   struct page *pages[0];
>  };
>  
> diff 

Re: [PATCH 3/3] powerpc: enable support for GCC plugins

2016-12-08 Thread PaX Team
On 6 Dec 2016 at 17:28, Andrew Donnellan wrote:

> Enable support for GCC plugins on powerpc.
> 
> Add an additional version check in gcc-plugins-check to advise users to
> upgrade to gcc 5.2+ on powerpc to avoid issues with header files (gcc <=
> 4.6) or missing copies of rs6000-cpus.def (4.8 to 5.1 on 64-bit targets).

i don't think that this is the right approach. there's a general and a special
issue here, both of which need different handling.

the general problem is to detect problems related to gcc plugin headers and
notify the users about solutions. emitting various messages from a Makefile
is certainly not a scalable approach, just imagine how it will look when the
other 30+ archs begin to add their own special cases... if anything, they
should be documented in Documentation/gcc-plugins.txt (or a new doc if it
grows too big) and the Makefile message should just point at it.

as for the solutions, the general advice should enable the use of otherwise
failing gcc versions instead of forcing updating to new ones (though the
latter is advisable for other reasons but not everyone's in the position to
do so easily). in my experience all one needs to do is manually install the
missing files from the gcc sources (ideally distros would take care of it).

the specific problem addressed here can (and IMHO should) be solved in
another way: remove the inclusion of the offending headers in gcc-common.h
as neither tm.h nor c-common.h are needed by existing plugins. for background,
i created gcc-common.h to simplify plugin development across all supportable
gcc versions i came across over the years, so it follows the 'everything but
the kitchen sink' approach. that isn't necessarily what the kernel and other
projects need so they should just use my version as a basis and fork/simplify
it (even i maintain private forks of the public version).

as for the location of c-common.h, upstream gcc moved it under c-family in
2010 after the release of 4.5, so it should be where gcc-common.h expects
it and i'm not sure how it ended up at its old location for you.

cheers,
 PaX Team



Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing

2016-12-08 Thread Brian King
Reviewed-by: Brian King 

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: linux-next: build failure in the powerpc allyesconfig build

2016-12-08 Thread Arnd Bergmann
On Monday, December 5, 2016 4:22:04 PM CET Stephen Rothwell wrote:
> Hi all,
> 
> After mergeing everything but Andrew's tree, today's linux-next build
> (powerpc allyesconfig) failed like this:
> 
> kallsyms failure: relative symbol value 0xc000 out of range in 
> relative mode
> 
> I have no idea what caused this, so I have left the powerpc allyesconfig
> build broken for now.

I get this on an x86-64 randconfig build:

kallsyms failure: relative symbol value 0x8100 out of range in 
relative mode

This is probably related, so it's not something powerpc specific.

Arnd


Re: 4.9.0-rc8 - rcutorture test failure

2016-12-08 Thread Paul E. McKenney
On Thu, Dec 08, 2016 at 11:54:15AM +0530, Sachin Sant wrote:
> RCU Torture test on powerpc fails during its run against latest mainline
> (4.9.0-rc8) tree.
> 
> 07:58:25 BUG: rcutorture tests failed !
> 07:58:25 21:31:00 ERROR| child process failed
> 07:58:25 21:31:00 INFO |  ERROR   rcutorture  rcutorture  
> timestamp=1481164260localtime=Dec 07 21:31:00   
> 07:58:25   BUG: rcutorture tests failed !
> 07:58:25 21:31:00 INFO |  END ERROR   rcutorture  rcutorture  
> timestamp=1481164260localtime=Dec 07 21:31:00   
> 
> I have attached complete rcutorture run log.

Thank you for running this, Sachin!

But I am not seeing this as a failure.  The last status print from the
log you attached is as follows:

07:58:25 [ 2778.876118] rcu-torture: rtc:   (null) ver: 24968 tfle: 0 
rta: 24968 rtaf: 0 rtf: 24959 rtmbe: 0 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 
nt: 10218404 onoff: 0/0:0/0 -1,0:-1,0 0:0 (HZ=250) barrier: 0/0:0 cbflood: 22703
07:58:25 [ 2778.876251] rcu-torture: Reader Pipe:  161849976604 399197 0 0 0 0 
0 0 0 0 0
07:58:25 [ 2778.876438] rcu-torture: Reader Batch:  145090807711 16759538163 0 
0 0 0 0 0 0 0 0
07:58:25 [ 2778.876625] rcu-torture: Free-Block Circulation:  24967 24967 24966 
24965 24964 24963 24962 24961 24960 24959 0
07:58:25 [ 2778.876829] rcu-torture:--- End of test: SUCCESS: nreaders=79 
nfakewriters=4 stat_interval=60 verbose=1 test_no_idle_hz=1 shuffle_interval=3 
stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 
test_boost_interval=7 test_boost_duration=4 shutdown_secs=0 stall_cpu=0 
stall_cpu_holdoff=10 n_barrier_cbs=0 onoff_interval=0 onoff_holdoff=0

The "SUCCESS" indicates that rcutorture thought that it succeeded.
Also, in the "Reader Pipe" and "Reader Batch" lines, only the first two
numbers in the series at the end of each line are non-zero, which also
indicates a non-broken RCU.

So could you please let me know what your scripting didn't like about
this log?

Thanx, Paul



Full log:

07:19:04 20:51:39 INFO | Test: running rcutorture tests
07:19:04 20:51:39 INFO | rcutorture
07:19:05 20:51:40 INFO |START   rcutorture  rcutorture  
timestamp=1481161900localtime=Dec 07 20:51:40   
07:19:05 20:51:40 INFO | Check if CONFIG_RCU_TORTURE_TEST is enabled
07:19:05 
07:19:05 [  418.897476] rcu-torture:--- Start of test: nreaders=79 
nfakewriters=4 stat_interval=60 verbose=1 test_no_idle_hz=1 shuffle_interval=3 
stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 
test_boost_interval=7 test_boost_duration=4 shutdown_secs=0 stall_cpu=0 
stall_cpu_holdoff=10 n_barrier_cbs=0 onoff_interval=0 onoff_holdoff=0
07:19:05 [  418.897843] rcu-torture: Creating rcu_torture_writer task
07:19:05 [  418.897935] rcu-torture: Creating rcu_torture_fakewriter task
07:19:05 [  418.897941] rcu-torture: rcu_torture_writer task started
07:19:05 [  418.898074] rcu-torture: Creating rcu_torture_fakewriter task
07:19:05 [  418.898079] rcu-torture: rcu_torture_fakewriter task started
07:19:05 [  418.898238] rcu-torture: Creating rcu_torture_fakewriter task
07:19:05 [  418.898242] rcu-torture: rcu_torture_fakewriter task started
07:19:05 [  418.898412] rcu-torture: Creating rcu_torture_fakewriter task
07:19:05 [  418.898414] rcu-torture: rcu_torture_fakewriter task started
07:19:05 [  418.898566] rcu-torture: Creating rcu_torture_reader task
07:19:05 [  418.898569] rcu-torture: rcu_torture_fakewriter task started
07:19:05 [  418.898711] rcu-torture: Creating rcu_torture_reader task
07:19:05 [  418.898714] rcu-torture: rcu_torture_reader task started
07:19:05 [  418.898840] rcu-torture: Creating rcu_torture_reader task
07:19:05 [  418.898843] rcu-torture: rcu_torture_reader task started
07:19:05 [  418.898970] rcu-torture: Creating rcu_torture_reader task
07:19:05 [  418.898973] rcu-torture: rcu_torture_reader task started
07:19:05 [  418.899099] rcu-torture: Creating rcu_torture_reader task
07:19:05 [  418.899101] rcu-torture: rcu_torture_reader task started
07:19:05 [  418.899227] rcu-torture: Creating rcu_torture_reader task
07:19:05 [  418.899230] rcu-torture: rcu_torture_reader task started
07:19:05 [  418.899357] rcu-torture: Creating rcu_torture_reader task
07:19:05 [  418.899360] rcu-torture: rcu_torture_reader task started
07:19:05 [  418.899485] rcu-torture: Creating rcu_torture_reader task
07:19:05 [  418.899488] rcu-torture: rcu_torture_reader task started
07:19:05 [  418.899630] rcu-torture: Creating rcu_torture_reader task
07:19:05 [  418.899633] rcu-torture: rcu_torture_reader task started
07:19:05 [  418.899789] rcu-torture: Creating rcu_torture_reader task
07:19:05 [  418.899790] rcu-torture: rcu_torture_reader task started
07:19:05 [  418.899937] rcu-torture: Creating rcu_torture_reader task
07:19:05 [  418.899940] rcu-torture: 

Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing

2016-12-08 Thread Johannes Thumshirn
On Wed, Dec 07, 2016 at 05:31:26PM -0600, Tyrel Datwyler wrote:
> The first byte of each CRQ entry is used to indicate whether an entry is
> a valid response or free for the VIOS to use. After processing a
> response the driver sets the valid byte to zero to indicate the entry is
> now free to be reused. Add a memory barrier after this write to ensure
> no other stores are reordered when updating the valid byte.
> 
> Signed-off-by: Tyrel Datwyler 
> ---
>  drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c 
> b/drivers/scsi/ibmvscsi/ibmvscsi.c
> index d9534ee..2f5b07e 100644
> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
> @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data)
>   while ((crq = crq_queue_next_crq(>queue)) != NULL) {
>   ibmvscsi_handle_crq(crq, hostdata);
>   crq->valid = VIOSRP_CRQ_FREE;
> + wmb();
>   }
>  
>   vio_enable_interrupts(vdev);
> @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data)
>   vio_disable_interrupts(vdev);
>   ibmvscsi_handle_crq(crq, hostdata);
>   crq->valid = VIOSRP_CRQ_FREE;
> + wmb();
>   } else {
>   done = 1;
>   }

Is this something you have seen in the wild or just a "better save than sorry"
barrier?

Thanks,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 2/2] ibmvscsi: log bad SRP response opcode in hex format

2016-12-08 Thread Johannes Thumshirn
On Wed, Dec 07, 2016 at 04:04:36PM -0600, Tyrel Datwyler wrote:
> An unrecogonized or unsupported SRP response has its opcode currently
> logged in decimal format. Log it in hex format instead so it can easily
> be validated against the SRP specs values which are in hex.
> 
> Signed-off-by: Tyrel Datwyler 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 1/2] ibmvscsi: add vscsi hosts to global list_head

2016-12-08 Thread Johannes Thumshirn
On Wed, Dec 07, 2016 at 04:04:35PM -0600, Tyrel Datwyler wrote:
> Add each vscsi host adatper to a new global list_head named
> ibmvscsi_head. There is no functional change. This is meant primarily
> as a convience for locating adatpers from within the debugger or crash
> utility.
> 
> Signed-off-by: Tyrel Datwyler 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH kernel 7/9] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently

2016-12-08 Thread Alexey Kardashevskiy
It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 029be26b5a17..65a471de96de 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -67,6 +67,7 @@ config KVM_BOOK3S_64
select KVM_BOOK3S_64_HANDLER
select KVM
select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+   select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
---help---
  Support running unmodified book3s_64 and book3s_32 guest kernels
  in virtual machines on book3s_64 host processors.
-- 
2.11.0



[PATCH kernel 6/9] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()

2016-12-08 Thread Alexey Kardashevskiy
In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/iommu.h  |  7 +++
 arch/powerpc/kernel/iommu.c   | 23 +++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9de8bad1fdf9..82e77ebf85f4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
long index,
unsigned long *hpa,
enum dma_data_direction *direction);
+   /* Real mode */
+   int (*exchange_rm)(struct iommu_table *tbl,
+   long index,
+   unsigned long *hpa,
+   enum dma_data_direction *direction);
 #endif
void (*clear)(struct iommu_table *tbl,
long index, long npages);
@@ -209,6 +214,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+   unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d12496889ce9..d02b8d22fb50 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1022,6 +1022,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned 
long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+   unsigned long *hpa, enum dma_data_direction *direction)
+{
+   long ret;
+
+   ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+   if (!ret && ((*direction == DMA_FROM_DEVICE) ||
+   (*direction == DMA_BIDIRECTIONAL))) {
+   struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+   if (likely(pg)) {
+   SetPageDirty(pg);
+   } else {
+   tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+   ret = -EFAULT;
+   }
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index ea181f02bebd..f2c2ab8fbb3e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1855,6 +1855,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, 
long index,
 
return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+   unsigned long *hpa, enum dma_data_direction *direction)
+{
+   long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+   if (!ret)
+   pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+   return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1869,6 +1880,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
.exchange = pnv_ioda1_tce_xchg,
+   .exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
.clear = pnv_ioda1_tce_free,
.get = pnv_tce_get,
@@ -1943,7 +1955,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
iommu_table *tbl,
 {
struct iommu_table_group_link *tgl;
 
-   list_for_each_entry_rcu(tgl, >it_group_list, next) {
+   list_for_each_entry_lockless(tgl, >it_group_list, next) {
struct pnv_ioda_pe *pe = container_of(tgl->table_group,
struct pnv_ioda_pe, table_group);
struct pnv_phb *phb = pe->phb;
@@ -1999,6 +2011,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, 
long index,
 
return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
+ 

[PATCH kernel 2/9] powerpc/iommu: Cleanup iommu_table disposal

2016-12-08 Thread Alexey Kardashevskiy
At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/iommu.c   | 4 
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++
 drivers/vfio/vfio_iommu_spapr_tce.c   | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f202a566ec5..6744a2771769 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
if (!tbl)
return;
 
+   if (tbl->it_ops->free)
+   tbl->it_ops->free(tbl);
+
if (!tbl->it_map) {
kfree(tbl);
return;
@@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
/* free table */
kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5fcae29107e1..c4f9e812ca6c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
*dev, struct pnv_ioda_pe
iommu_group_put(pe->table_group.group);
BUG_ON(pe->table_group.group);
}
-   pnv_pci_ioda2_table_free_pages(tbl);
iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -2013,7 +2012,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
pnv_pci_ioda2_table_free_pages(tbl);
-   iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2339,7 +2337,7 @@ static long pnv_pci_ioda2_setup_default_config(struct 
pnv_ioda_pe *pe)
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
rc);
-   pnv_ioda2_table_free(tbl);
+   iommu_free_table(tbl, "");
return rc;
}
 
@@ -2425,7 +2423,7 @@ static void pnv_ioda2_take_ownership(struct 
iommu_table_group *table_group)
 
pnv_pci_ioda2_set_bypass(pe, false);
pnv_pci_ioda2_unset_window(>table_group, 0);
-   pnv_ioda2_table_free(tbl);
+   iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index c8823578a1b2..cbac08af400e 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -677,7 +677,7 @@ static void tce_iommu_free_table(struct tce_container 
*container,
unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
tce_iommu_userspace_view_free(tbl, container->mm);
-   tbl->it_ops->free(tbl);
+   iommu_free_table(tbl, "");
decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0



[PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO

2016-12-08 Thread Alexey Kardashevskiy
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is referenced so we do not have to retrieve in real mode when hypercall
happens.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is detroyed so this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

This uses the kvm->lock mutex to protect against a race between
the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
release() callback.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy 
---
 Documentation/virtual/kvm/devices/vfio.txt |  21 +-
 arch/powerpc/include/asm/kvm_host.h|   8 +
 arch/powerpc/include/asm/kvm_ppc.h |   5 +
 include/uapi/linux/kvm.h   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c   | 302 +
 arch/powerpc/kvm/book3s_64_vio_hv.c| 178 +
 arch/powerpc/kvm/powerpc.c |   2 +
 virt/kvm/vfio.c| 108 +++
 8 files changed, 630 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..ddb5a6512ab3 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,24 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+   kvm_device_attr.addr points to an int32_t file descriptor
+   for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+   kvm_device_attr.addr points to an int32_t file descriptor
+   for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+   allocated by sPAPR KVM.
+   kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+   struct kvm_vfio_spapr_tce {
+   __u32   argsz;
+   __s32   groupfd;
+   __s32   tablefd;
+   __u8pad[4];
+   };
+
+   where
+   @argsz is the size of kvm_vfio_spapr_tce_liobn;
+   @groupfd is a file descriptor for a VFIO group;
+   @tablefd is a file descriptor for a TCE table allocated via
+   KVM_CREATE_SPAPR_TCE.
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 28350a294b1e..94774503c70d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -191,6 +191,13 @@ struct kvmppc_pginfo {
atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+   struct rcu_head rcu;
+   struct list_head next;
+   struct iommu_table *tbl;
+   atomic_t refs;
+};
+
 struct kvmppc_spapr_tce_table {
struct list_head list;
struct kvm *kvm;
@@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
u32 page_shift;
u64 offset; /* in pages */
u64 size;   /* window size in pages */
+   struct list_head iommu_tables;
struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 0a21c8503974..17b947a0060d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -163,6 +163,11 @@ extern long 

[PATCH kernel 8/9] KVM: PPC: Pass kvm* to kvmppc_find_table()

2016-12-08 Thread Alexey Kardashevskiy
The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c|  7 ---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++--
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index f6e49640dbe1..0a21c8503974 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-   struct kvm_vcpu *vcpu, unsigned long liobn);
+   struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index c379ff5a4438..15df8ae627d9 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -212,12 +212,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba, unsigned long tce)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
 
/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
/*  liobn, ioba, tce); */
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
u64 __user *tces;
u64 tce;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
struct kvmppc_spapr_tce_table *stt;
long i, ret;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index a3be4bd6188f..8a6834e6e1c8 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -49,10 +49,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *  mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
unsigned long liobn)
 {
-   struct kvm *kvm = vcpu->kvm;
struct kvmppc_spapr_tce_table *stt;
 
list_for_each_entry_lockless(stt, >arch.spapr_tce_tables, list)
@@ -194,12 +193,13 @@ static struct mm_iommu_table_group_mem_t 
*kvmppc_rm_iommu_lookup(
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
 
/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
/*  liobn, ioba, tce); */
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -252,7 +252,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
unsigned long tces, entry, ua = 0;
unsigned long *rmap = NULL;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -335,7 +335,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
struct kvmppc_spapr_tce_table *stt;
long i, ret;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -356,12 +356,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
unsigned long idx;
struct page *page;
u64 *tbl;
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
-- 
2.11.0



[PATCH kernel 5/9] KVM: PPC: Use preregistered memory API to access TCE list

2016-12-08 Thread Alexey Kardashevskiy
VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL where declared
(not in this patch).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* updated the commit log with Paul's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 65 -
 1 file changed, 49 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index d461c440889a..a3be4bd6188f 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
+{
+   return mm_iommu_preregistered(vcpu->kvm->mm);
+}
+
+static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
+   struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
+{
+   return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
 {
@@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if (ret != H_SUCCESS)
return ret;
 
-   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , ))
-   return H_TOO_HARD;
+   if (kvmppc_preregistered(vcpu)) {
+   /*
+* We get here if guest memory was pre-registered which
+* is normally VFIO case and gpa->hpa translation does not
+* depend on hpt.
+*/
+   struct mm_iommu_table_group_mem_t *mem;
 
-   rmap = (void *) vmalloc_to_phys(rmap);
+   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , NULL))
+   return H_TOO_HARD;
 
-   /*
-* Synchronize with the MMU notifier callbacks in
-* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-* While we have the rmap lock, code running on other CPUs
-* cannot finish unmapping the host real page that backs
-* this guest real page, so we are OK to access the host
-* real page.
-*/
-   lock_rmap(rmap);
-   if (kvmppc_rm_ua_to_hpa(vcpu, ua, )) {
-   ret = H_TOO_HARD;
-   goto unlock_exit;
+   mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
+   if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, ))
+   return H_TOO_HARD;
+   } else {
+   /*
+* This is emulated devices case.
+* We do not require memory to be preregistered in this case
+* so lock rmap and do __find_linux_pte_or_hugepte().
+*/
+   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , ))
+   return H_TOO_HARD;
+
+   rmap = (void *) vmalloc_to_phys(rmap);
+
+   /*
+* Synchronize with the MMU notifier callbacks in
+* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+* While we have the rmap lock, code running on other CPUs
+* cannot finish unmapping the host real page that backs
+* this guest real page, so we are OK to access the host
+* real page.
+*/
+   lock_rmap(rmap);
+   if (kvmppc_rm_ua_to_hpa(vcpu, ua, )) {
+   ret = H_TOO_HARD;
+   goto unlock_exit;
+   }
}
 
for (i = 0; i < npages; ++i) {
@@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
}
 
 unlock_exit:
-   unlock_rmap(rmap);
+   if (rmap)
+   unlock_rmap(rmap);
 
return ret;
 }
-- 
2.11.0



[PATCH kernel 4/9] powerpc/mmu: Add real mode support for IOMMU preregistered memory

2016-12-08 Thread Alexey Kardashevskiy
This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/mmu_context.h |  4 
 arch/powerpc/mm/mmu_context_iommu.c| 39 ++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index b9e3f0aca261..c70c8272523d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+   struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+   unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 104bad029ce9..631d32f5937b 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct 
mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+   unsigned long ua, unsigned long size)
+{
+   struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+   list_for_each_entry_lockless(mem, >context.iommu_group_mem_list,
+   next) {
+   if ((mem->ua <= ua) &&
+   (ua + size <= mem->ua +
+(mem->entries << PAGE_SHIFT))) {
+   ret = mem;
+   break;
+   }
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
unsigned long ua, unsigned long entries)
 {
@@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t 
*mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+   unsigned long ua, unsigned long *hpa)
+{
+   const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+   void *va = >hpas[entry];
+   unsigned long *pa;
+
+   if (entry >= mem->entries)
+   return -EFAULT;
+
+   pa = (void *) vmalloc_to_phys(va);
+   if (!pa)
+   return -EFAULT;
+
+   *hpa = *pa | (ua & ~PAGE_MASK);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
if (atomic64_inc_not_zero(>mapped))
-- 
2.11.0



[PATCH kernel 3/9] powerpc/vfio_spapr_tce: Add reference counting to iommu_table

2016-12-08 Thread Alexey Kardashevskiy
So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_table_put() and makes
iommu_free_table() static. iommu_table_get() is not used in this patch
but it will be in the following patch.

Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/iommu.h  |  5 +++--
 arch/powerpc/kernel/iommu.c   | 24 +++-
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++---
 arch/powerpc/platforms/powernv/pci.c  |  1 +
 arch/powerpc/platforms/pseries/iommu.c|  3 ++-
 arch/powerpc/platforms/pseries/vio.c  |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
 7 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2c1d50792944..9de8bad1fdf9 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -114,6 +114,7 @@ struct iommu_table {
struct list_head it_group_list;/* List of iommu_table_group_link */
unsigned long *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops;
+   struct krefit_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -146,8 +147,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern void iommu_table_get(struct iommu_table *tbl);
+extern void iommu_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 6744a2771769..d12496889ce9 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid)
return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
unsigned long bitmap_sz;
unsigned int order;
+   struct iommu_table *tbl;
 
-   if (!tbl)
-   return;
+   tbl = container_of(kref, struct iommu_table, it_kref);
 
if (tbl->it_ops->free)
tbl->it_ops->free(tbl);
@@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
 
/* verify that table contains no entries */
if (!bitmap_empty(tbl->it_map, tbl->it_size))
-   pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+   pr_warn("%s: Unexpected TCEs\n", __func__);
 
/* calculate bitmap size in bytes */
bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
/* free table */
kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+void iommu_table_get(struct iommu_table *tbl)
+{
+   kref_get(>it_kref);
+}
+EXPORT_SYMBOL_GPL(iommu_table_get);
+
+void iommu_table_put(struct iommu_table *tbl)
+{
+   if (!tbl)
+   return;
+
+   kref_put(>it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c4f9e812ca6c..ea181f02bebd 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,7 +1422,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
*dev, struct pnv_ioda_pe
iommu_group_put(pe->table_group.group);
BUG_ON(pe->table_group.group);
}
-   iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+   iommu_table_put(tbl);
 }
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
@@ -2197,7 +2197,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb 
*phb,
__free_pages(tce_mem, get_order(tce32_segsz * segs));
if (tbl) {
pnv_pci_unlink_table_and_group(tbl, >table_group);
-   iommu_free_table(tbl, "pnv");
+   iommu_table_put(tbl);
  

[PATCH kernel 1/9] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number

2016-12-08 Thread Alexey Kardashevskiy
This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 300ef255d1e0..810f74317987 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -870,6 +870,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_USER_INSTR0 130
 #define KVM_CAP_MSI_DEVID 131
 #define KVM_CAP_PPC_HTM 132
+#define KVM_CAP_SPAPR_TCE_VFIO 133
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.11.0



[PATCH kernel 0/9] powerpc/kvm/vfio: Enable in-kernel acceleration

2016-12-08 Thread Alexey Kardashevskiy
This is my current queue of patches to add acceleration of TCE
updates in KVM.

This is based on the "next" branch of
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git

I am not doing changelog here as it is 4 months since last respin
and I am sure everybody lost the context anyway, I tried to be
as detailed as I could in the very last patch, others are
pretty trivial anyway.

Please comment. Thanks.


Alexey Kardashevskiy (9):
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  powerpc/iommu: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  KVM: PPC: Use preregistered memory API to access TCE list
  powerpc/powernv/iommu: Add real mode version of
iommu_table_ops::exchange()
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  KVM: PPC: Add in-kernel acceleration for VFIO

 Documentation/virtual/kvm/devices/vfio.txt |  21 +-
 arch/powerpc/include/asm/iommu.h   |  12 +-
 arch/powerpc/include/asm/kvm_host.h|   8 +
 arch/powerpc/include/asm/kvm_ppc.h |   7 +-
 arch/powerpc/include/asm/mmu_context.h |   4 +
 include/uapi/linux/kvm.h   |   9 +
 arch/powerpc/kernel/iommu.c|  49 -
 arch/powerpc/kvm/book3s_64_vio.c   | 309 -
 arch/powerpc/kvm/book3s_64_vio_hv.c| 256 ++--
 arch/powerpc/kvm/powerpc.c |   2 +
 arch/powerpc/mm/mmu_context_iommu.c|  39 
 arch/powerpc/platforms/powernv/pci-ioda.c  |  42 +++-
 arch/powerpc/platforms/powernv/pci.c   |   1 +
 arch/powerpc/platforms/pseries/iommu.c |   3 +-
 arch/powerpc/platforms/pseries/vio.c   |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c|   2 +-
 virt/kvm/vfio.c| 108 ++
 arch/powerpc/kvm/Kconfig   |   1 +
 18 files changed, 828 insertions(+), 47 deletions(-)

-- 
2.11.0