date:20161024

[PATCH net-next] ibmveth: calculate correct gso_size and set gso_type

2016-10-24 Thread Jon Maxwell

We recently encountered a bug where a few customers using ibmveth on the 
same LPAR hit an issue where a TCP session hung when large receive was
enabled. Closer analysis revealed that the session was stuck because the 
one side was advertising a zero window repeatedly.

We narrowed this down to the fact the ibmveth driver did not set gso_size 
which is translated by TCP into the MSS later up the stack. The MSS is 
used to calculate the TCP window size and as that was abnormally large, 
it was calculating a zero window, even although the sockets receive buffer 
was completely empty. 

We were able to reproduce this and worked with IBM to fix this. Thanks Tom 
and Marcelo for all your help and review on this.

The patch fixes both our internal reproduction tests and our customers tests.

Signed-off-by: Jon Maxwell 
---
 drivers/net/ethernet/ibm/ibmveth.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
b/drivers/net/ethernet/ibm/ibmveth.c
index 29c05d0..3028c33 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int 
budget)
int frames_processed = 0;
unsigned long lpar_rc;
struct iphdr *iph;
+   bool large_packet = 0;
+   u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
 
 restart_poll:
while (frames_processed < budget) {
@@ -1236,10 +1238,27 @@ static int ibmveth_poll(struct napi_struct *napi, int 
budget)
iph->check = 0;
iph->check = 
ip_fast_csum((unsigned char *)iph, iph->ihl);
adapter->rx_large_packets++;
+   large_packet = 1;
}
}
}
 
+   if (skb->len > netdev->mtu) {
+   iph = (struct iphdr *)skb->data;
+   if (be16_to_cpu(skb->protocol) == ETH_P_IP && 
iph->protocol == IPPROTO_TCP) {
+   hdr_len += sizeof(struct iphdr);
+   skb_shinfo(skb)->gso_type = 
SKB_GSO_TCPV4;
+   skb_shinfo(skb)->gso_size = netdev->mtu 
- hdr_len;
+   } else if (be16_to_cpu(skb->protocol) == 
ETH_P_IPV6 &&
+   iph->protocol == IPPROTO_TCP) {
+   hdr_len += sizeof(struct ipv6hdr);
+   skb_shinfo(skb)->gso_type = 
SKB_GSO_TCPV6;
+   skb_shinfo(skb)->gso_size = netdev->mtu 
- hdr_len;
+   }
+   if (!large_packet)
+   adapter->rx_large_packets++;
+   }
+
napi_gro_receive(napi, skb);/* send it up */
 
netdev->stats.rx_packets++;
-- 
1.8.3.1

Re: [RFC PATCH] powerpc/powernv: recheck lock bit in core_idle_lock_held after lwarx

2016-10-24 Thread Li Zhong


> On 25 Oct 2016, at 12:15, Paul Mackerras  wrote:
> 
> On Tue, Oct 25, 2016 at 11:24:34AM +0800, Li Zhong wrote:
>> The core_idle_lock_held loops when the lock bit is held by others.
>> However, it is possible that after the lock bit is cleared, some one
>> else sees it first and sets the lock bit. And lwarx loads a value with
>> lock bit set, and the lock bit may be cleared in the following stwcx.
>> It is possible the first one is still executing in the critical section.
>> 
>> This patch rechecks the lock bit after lwarx, and go back to loop if it
>> is set.
> 
> You're quite correct, in fact I posted almost exactly the same patch a
> few days ago...  See http://patchwork.ozlabs.org/patch/684963/.


Forget to check the list before sending … 

One minor difference, maybe we can use bne- for the rechecking :)

Thanks, Zhong

> 
> Thanks,
> Paul.
>

Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

2016-10-24 Thread Alexey Kardashevskiy

On 25/10/16 15:44, David Gibson wrote:
> On Mon, Oct 24, 2016 at 05:53:10PM +1100, Alexey Kardashevskiy wrote:
>> At the moment the userspace tool is expected to request pinning of
>> the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
>> When the userspace process finishes, all the pinned pages need to
>> be put; this is done as a part of the userspace memory context (MM)
>> destruction which happens on the very last mmdrop().
>>
>> This approach has a problem that a MM of the userspace process
>> may live longer than the userspace process itself as kernel threads
>> use userspace process MMs which was runnning on a CPU where
>> the kernel thread was scheduled to. If this happened, the MM remains
>> referenced until this exact kernel thread wakes up again
>> and releases the very last reference to the MM, on an idle system this
>> can take even hours.
>>
>> This moves preregistered regions tracking from MM to VFIO; insteads of
>> using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
>> added so each container releases regions which it has pre-registered.
>>
>> This changes the userspace interface to return EBUSY if a memory
>> region is already registered in a container. However it should not
>> have any practical effect as the only userspace tool available now
>> does register memory region once per container anyway.
>>
>> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
>> under container->lock, this does not need additional locking.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> Reviewed-by: Nicholas Piggin 
> 
> On the grounds that this leaves things in a better state than before:
> 
> Reviewed-by: David Gibson 
> 
> On the other hand the implementation is kind of clunky, with the way
> it keeps the mm-level and vfio-level lists of regions in parallel.
> With this change, does the mm-level list actually serve any purpose at
> all, or could it all be moved into the vfio-level list?


The mm-level list allows not having gup() called for each container (minor
thing, I suppose) and it also tracks a number of active mappings which will
become useful when we add in-kernel real-mode TCE acceleration as
vfio-level code cannot run in realmode.



> 
>> ---
>> Changes:
>> v4:
>> * changed tce_iommu_register_pages() to call mm_iommu_find() first and
>> avoid calling mm_iommu_put() if memory is preregistered already
>>
>> v3:
>> * moved tce_iommu_prereg_free() call out of list_for_each_entry()
>>
>> v2:
>> * updated commit log
>> ---
>>  arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
>>  arch/powerpc/mm/mmu_context_iommu.c| 11 ---
>>  drivers/vfio/vfio_iommu_spapr_tce.c| 58 
>> +-
>>  3 files changed, 57 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
>> b/arch/powerpc/mm/mmu_context_book3s64.c
>> index ad82735..1a07969 100644
>> --- a/arch/powerpc/mm/mmu_context_book3s64.c
>> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
>> @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct 
>> mm_struct *mm)
>>  
>>  void destroy_context(struct mm_struct *mm)
>>  {
>> -#ifdef CONFIG_SPAPR_TCE_IOMMU
>> -mm_iommu_cleanup(mm);
>> -#endif
>> -
>>  #ifdef CONFIG_PPC_ICSWX
>>  drop_cop(mm->context.acop, mm);
>>  kfree(mm->context.cop_lockp);
>> diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
>> b/arch/powerpc/mm/mmu_context_iommu.c
>> index 4c6db09..104bad0 100644
>> --- a/arch/powerpc/mm/mmu_context_iommu.c
>> +++ b/arch/powerpc/mm/mmu_context_iommu.c
>> @@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
>>  {
>>  INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list);
>>  }
>> -
>> -void mm_iommu_cleanup(struct mm_struct *mm)
>> -{
>> -struct mm_iommu_table_group_mem_t *mem, *tmp;
>> -
>> -list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list,
>> -next) {
>> -list_del_rcu(>next);
>> -mm_iommu_do_free(mem);
>> -}
>> -}
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
>> b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 81ab93f..001a488 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -86,6 +86,15 @@ struct tce_iommu_group {
>>  };
>>  
>>  /*
>> + * A container needs to remember which preregistered region  it has
>> + * referenced to do proper cleanup at the userspace process exit.
>> + */
>> +struct tce_iommu_prereg {
>> +struct list_head next;
>> +struct mm_iommu_table_group_mem_t *mem;
>> +};
>> +
>> +/*
>>   * The container descriptor supports only a single group per container.
>>   * Required by the API as the container is not supplied with the IOMMU group
>>   * at the moment of initialization.
>> @@ -98,12 +107,27 @@ struct tce_container {
>>  struct mm_struct *mm;
>>  struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>>  struct list_head

Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

2016-10-24 Thread David Gibson

On Mon, Oct 24, 2016 at 05:53:10PM +1100, Alexey Kardashevskiy wrote:
> At the moment the userspace tool is expected to request pinning of
> the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
> When the userspace process finishes, all the pinned pages need to
> be put; this is done as a part of the userspace memory context (MM)
> destruction which happens on the very last mmdrop().
> 
> This approach has a problem that a MM of the userspace process
> may live longer than the userspace process itself as kernel threads
> use userspace process MMs which was runnning on a CPU where
> the kernel thread was scheduled to. If this happened, the MM remains
> referenced until this exact kernel thread wakes up again
> and releases the very last reference to the MM, on an idle system this
> can take even hours.
> 
> This moves preregistered regions tracking from MM to VFIO; insteads of
> using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
> added so each container releases regions which it has pre-registered.
> 
> This changes the userspace interface to return EBUSY if a memory
> region is already registered in a container. However it should not
> have any practical effect as the only userspace tool available now
> does register memory region once per container anyway.
> 
> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
> under container->lock, this does not need additional locking.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: Nicholas Piggin 

On the grounds that this leaves things in a better state than before:

Reviewed-by: David Gibson 

On the other hand the implementation is kind of clunky, with the way
it keeps the mm-level and vfio-level lists of regions in parallel.
With this change, does the mm-level list actually serve any purpose at
all, or could it all be moved into the vfio-level list?

> ---
> Changes:
> v4:
> * changed tce_iommu_register_pages() to call mm_iommu_find() first and
> avoid calling mm_iommu_put() if memory is preregistered already
> 
> v3:
> * moved tce_iommu_prereg_free() call out of list_for_each_entry()
> 
> v2:
> * updated commit log
> ---
>  arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
>  arch/powerpc/mm/mmu_context_iommu.c| 11 ---
>  drivers/vfio/vfio_iommu_spapr_tce.c| 58 
> +-
>  3 files changed, 57 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
> b/arch/powerpc/mm/mmu_context_book3s64.c
> index ad82735..1a07969 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct 
> mm_struct *mm)
>  
>  void destroy_context(struct mm_struct *mm)
>  {
> -#ifdef CONFIG_SPAPR_TCE_IOMMU
> - mm_iommu_cleanup(mm);
> -#endif
> -
>  #ifdef CONFIG_PPC_ICSWX
>   drop_cop(mm->context.acop, mm);
>   kfree(mm->context.cop_lockp);
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
> b/arch/powerpc/mm/mmu_context_iommu.c
> index 4c6db09..104bad0 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
>  {
>   INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list);
>  }
> -
> -void mm_iommu_cleanup(struct mm_struct *mm)
> -{
> - struct mm_iommu_table_group_mem_t *mem, *tmp;
> -
> - list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list,
> - next) {
> - list_del_rcu(>next);
> - mm_iommu_do_free(mem);
> - }
> -}
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 81ab93f..001a488 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -86,6 +86,15 @@ struct tce_iommu_group {
>  };
>  
>  /*
> + * A container needs to remember which preregistered region  it has
> + * referenced to do proper cleanup at the userspace process exit.
> + */
> +struct tce_iommu_prereg {
> + struct list_head next;
> + struct mm_iommu_table_group_mem_t *mem;
> +};
> +
> +/*
>   * The container descriptor supports only a single group per container.
>   * Required by the API as the container is not supplied with the IOMMU group
>   * at the moment of initialization.
> @@ -98,12 +107,27 @@ struct tce_container {
>   struct mm_struct *mm;
>   struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>   struct list_head group_list;
> + struct list_head prereg_list;
>  };
>  
> +static long tce_iommu_prereg_free(struct tce_container *container,
> + struct tce_iommu_prereg *tcemem)
> +{
> + long ret;
> +
> + list_del(>next);
> + ret = mm_iommu_put(container->mm, tcemem->mem);
> + kfree(tcemem);
> +
> + return ret;
> +}
> +
>  static long tce_iommu_unregister_pages(struct tce_container

Re: [RFC PATCH] powerpc/powernv: recheck lock bit in core_idle_lock_held after lwarx

2016-10-24 Thread Paul Mackerras

On Tue, Oct 25, 2016 at 11:24:34AM +0800, Li Zhong wrote:
> The core_idle_lock_held loops when the lock bit is held by others.
> However, it is possible that after the lock bit is cleared, some one
> else sees it first and sets the lock bit. And lwarx loads a value with
> lock bit set, and the lock bit may be cleared in the following stwcx.
> It is possible the first one is still executing in the critical section.
> 
> This patch rechecks the lock bit after lwarx, and go back to loop if it
> is set.

You're quite correct, in fact I posted almost exactly the same patch a
few days ago...  See http://patchwork.ozlabs.org/patch/684963/.

Thanks,
Paul.

[PATCH] powerpc: Use pr_warn instead of pr_warning

2016-10-24 Thread Joe Perches

At some point, pr_warning will be removed so all logging messages use
a consistent _warn style.

Update arch/powerpc/

Miscellanea:

o Coalesce formats
o Realign arguments
o Use %s, __func__ instead of embedded function names
o Remove unnecessary line continuations

Signed-off-by: Joe Perches 
---
 arch/powerpc/kernel/pci-common.c|  4 ++--
 arch/powerpc/mm/init_64.c   |  5 ++---
 arch/powerpc/mm/mem.c   |  3 +--
 arch/powerpc/platforms/512x/mpc512x_shared.c|  4 ++--
 arch/powerpc/platforms/85xx/socrates_fpga_pic.c |  7 +++
 arch/powerpc/platforms/86xx/mpc86xx_hpcn.c  |  2 +-
 arch/powerpc/platforms/pasemi/dma_lib.c |  4 ++--
 arch/powerpc/platforms/powernv/opal.c   |  8 
 arch/powerpc/platforms/powernv/pci-ioda.c   | 10 +-
 arch/powerpc/platforms/ps3/device-init.c| 12 +---
 arch/powerpc/platforms/ps3/mm.c |  4 ++--
 arch/powerpc/platforms/ps3/os-area.c|  2 +-
 arch/powerpc/platforms/pseries/iommu.c  |  8 +++-
 arch/powerpc/platforms/pseries/setup.c  |  4 ++--
 arch/powerpc/sysdev/fsl_pci.c   |  9 +++--
 arch/powerpc/sysdev/mpic.c  | 10 --
 arch/powerpc/sysdev/xics/icp-native.c   | 10 --
 arch/powerpc/sysdev/xics/ics-opal.c |  4 +--
 arch/powerpc/sysdev/xics/ics-rtas.c |  4 ++--
 arch/powerpc/sysdev/xics/xics-common.c  |  8 
 20 files changed, 54 insertions(+), 68 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 74bec5498972..338f7f7dbfb2 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1267,8 +1267,8 @@ static void pcibios_allocate_bus_resources(struct pci_bus 
*bus)
i + PCI_BRIDGE_RESOURCES) == 0)
continue;
}
-   pr_warning("PCI: Cannot allocate resource region "
-  "%d of PCI bridge %d, will remap\n", i, bus->number);
+   pr_warn("PCI: Cannot allocate resource region %d of PCI bridge 
%d, will remap\n",
+   i, bus->number);
clear_resource:
/* The resource might be figured out when doing
 * reassignment based on the resources required
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 16ada1eb7e26..b5c18fe202e5 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -267,9 +267,8 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node)
 
rc = vmemmap_create_mapping(start, page_size, __pa(p));
if (rc < 0) {
-   pr_warning(
-   "vmemmap_populate: Unable to create vmemmap 
mapping: %d\n",
-   rc);
+   pr_warn("%s: Unable to create vmemmap mapping: %d\n",
+   __func__, rc);
return -EFAULT;
}
}
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5f844337de21..c4ec6e63ed43 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -139,8 +139,7 @@ int arch_add_memory(int nid, u64 start, u64 size, bool 
for_device)
start = (unsigned long)__va(start);
rc = create_section_mapping(start, start + size);
if (rc) {
-   pr_warning(
-   "Unable to create mapping for hot added memory 
0x%llx..0x%llx: %d\n",
+   pr_warn("Unable to create mapping for hot added memory 
0x%llx..0x%llx: %d\n",
start, start + size, rc);
return -EFAULT;
}
diff --git a/arch/powerpc/platforms/512x/mpc512x_shared.c 
b/arch/powerpc/platforms/512x/mpc512x_shared.c
index 6b4f4cb7009a..1592b7ce9635 100644
--- a/arch/powerpc/platforms/512x/mpc512x_shared.c
+++ b/arch/powerpc/platforms/512x/mpc512x_shared.c
@@ -387,8 +387,8 @@ static unsigned int __init get_fifo_size(struct device_node 
*np,
if (fp)
return *fp;
 
-   pr_warning("no %s property in %s node, defaulting to %d\n",
-  prop_name, np->full_name, DEFAULT_FIFO_SIZE);
+   pr_warn("no %s property in %s node, defaulting to %d\n",
+   prop_name, np->full_name, DEFAULT_FIFO_SIZE);
 
return DEFAULT_FIFO_SIZE;
 }
diff --git a/arch/powerpc/platforms/85xx/socrates_fpga_pic.c 
b/arch/powerpc/platforms/85xx/socrates_fpga_pic.c
index 82f8490b5aa7..38d4ba9f37b5 100644
--- a/arch/powerpc/platforms/85xx/socrates_fpga_pic.c
+++ b/arch/powerpc/platforms/85xx/socrates_fpga_pic.c
@@ -252,8 +252,7 @@ static int socrates_fpga_pic_host_xlate(struct irq_domain 
*h,
/* type is configurable */
if (intspec[1] != IRQ_TYPE_LEVEL_LOW &&

Re: [PATCH v2 2/2] PCI: Don't disable PF's memory decoding when enabling SRIOV

2016-10-24 Thread Bjorn Helgaas

On Tue, Oct 25, 2016 at 12:47:28PM +1100, Gavin Shan wrote:
> On Mon, Oct 24, 2016 at 09:03:16AM -0500, Bjorn Helgaas wrote:
> >On Mon, Oct 24, 2016 at 10:28:02AM +1100, Gavin Shan wrote:
> >> On Fri, Oct 21, 2016 at 11:50:34AM -0500, Bjorn Helgaas wrote:
> >> >Hi Gavin,
> >> >
> >> >On Fri, Sep 30, 2016 at 09:47:50AM +1000, Gavin Shan wrote:
> >> >> pci_update_resource() might be called to update (shift) IOV BARs
> >> >> in PPC PowerNV specific pcibios_sriov_enable() when enabling PF's
> >> >> SRIOV capability. At that point, the PF have been functional if
> >> >> the SRIOV is enabled through sysfs entry "sriov_numvfs". The PF's
> >> >> memory decoding (0x2 in PCI_COMMAND) shouldn't be disabled when
> >> >> updating its IOV BARs with pci_update_resource(). Otherwise, we
> >> >> receives EEH error caused by MMIO access to PF's memory BARs during
> >> >> the window when PF's memory decoding is disabled.
> >> >
> >> >The fact that you get EEH errors is irrelevant.  We can't write code
> >> >simply to avoid errors -- we have to write code to make the system
> >> >work correctly.
> >> >
> >> >I do not want to add a "mmio_force_on" parameter to
> >> >pci_update_resource().  That puts the burden on the caller to
> >> >understand this subtle issue.  If the caller passes mmio_force_on=1
> >> >when it shouldn't, things will almost always work, but once in a blue
> >> >moon a half-updated BAR will conflict with some other device in the
> >> >system, and we'll have an unreproducible, undebuggable crash.
> >> >
> >> 
> >> Bjorn, thanks for your comments. Yes, the EEH error was caused by MMIO
> >> access to PF's normal BARs, not VF BARs. Yeah, I also had the conern
> >> that adding parameter to pci_update_resource() increases the visible
> >> complexity to the caller of the function.
> >> 
> >> >What you do need is an explanation of why it's safe to non-atomically
> >> >update a VF BARx in the SR-IOV capability.  I think this probably
> >> >involves the VF MSE bit, and you probably have to either disable VFs
> >> >completely or clear the MSE bit while you're updating the VF BARx.  We
> >> >should be able to do this inside pci_update_resource() without
> >> >changing the interface.
> >> >
> >> 
> >> Yes, It's what PATCH[1/2] does: (PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE)
> >> are set after VF BARs are updated with pci_update_resource() in this PPC
> >> specific scenario. There are other two situations where the IOV BARs are
> >> updated: PCI resource resizing and allocation during system booting or hot
> >> adding PF. The VF shouldn't be enabled in both cases when updating IOV 
> >> BARs.
> >> 
> >> I think you suggest to add check in pci_update_resource(): Don't disable
> >> PF's memory decoding when updating VF BARs. I will send updated revision
> >> if it's what you want.
> >> 
> >>/*
> >> * The PF might be functional when updating its IOV BARs. So PF's
> >> * memory decoding shouldn't be disabled when updating its IOV BARs.
> >> */
> >>disable = (res->flags & IORESOURCE_MEM_64) && !dev->mmio_always_on;
> >> #ifdef CONFIG_PCI_IOV
> >>disable &= !(resno >= PCI_IOV_RESOURCES &&
> >> resno <= PCI_IOV_RESOURCE_END);
> >> #endif
> >>if (disable) {
> >>pci_read_config_word(dev, PCI_COMMAND, );
> >>pci_write_config_word(dev, PCI_COMMAND,
> >>  cmd & ~PCI_COMMAND_MEMORY);
> >>}
> >
> >Not exactly.  A 64-bit BAR cannot be updated atomically.  The whole
> >point of this exercise is to make sure that when we update such a BAR,
> >whether it is a normal PCI BAR or an SR-IOV BAR, the transient state
> >does not conflict with anything else in the system.
> >
> >For example, consider two devices that do not conflict:
> >
> >  device A BAR 0: 0x_2000
> >  device B BAR 0: 0x_4000
> >
> >We want to update A's BAR 0 to 0x0001_4000.  We can't do the
> >update atomically, so we have this sequence:
> >
> >  before update:device A BAR 0: 0x_2000
> >  after writing lower half: device A BAR 0: 0x_4000
> >  after writing upper half: device A BAR 0: 0x0001_4000
> >
> >If the driver for device B accesses B between the writes, both A and B
> >may respond, which is a fatal error.
> >
> 
> Thanks for the detailed explanation. Apart from pdev->mmio_always_on,
> the normal BARs are updated with PCI_COMMAND_MEMORY cleared. With
> PATCH[1/2], The IOV BARs are updated with PCI_SRIOV_CTRL_MSE and
> PCI_SRIOV_CTRL_VFE cleared in the problematic path the patch tried
> to address initially. However, I prefer your suggestion at end of
> the reply.
> 
> >Probably the *best* thing would be to make pci_update_resource()
> >return an error if it's asked to update a BAR that is enabled, but I
> >haven't looked at all the callers to see whether that's practical.
> >
> 
> It arguably enforces users to tackle the limitation: the memory
> decoding (PCI_COMMAND_MEMORY or

[RFC PATCH] powerpc/powernv: recheck lock bit in core_idle_lock_held after lwarx

2016-10-24 Thread Li Zhong

The core_idle_lock_held loops when the lock bit is held by others.
However, it is possible that after the lock bit is cleared, some one
else sees it first and sets the lock bit. And lwarx loads a value with
lock bit set, and the lock bit may be cleared in the following stwcx.
It is possible the first one is still executing in the critical section.

This patch rechecks the lock bit after lwarx, and go back to loop if it
is set.

Signed-off-by: Li Zhong 
---
 arch/powerpc/kernel/idle_book3s.S | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index bd739fe..ce07b3f 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -99,6 +99,8 @@ core_idle_lock_held:
bne 3b
HMT_MEDIUM
lwarx   r15,0,r14
+   andi.   r9,r15,PNV_CORE_IDLE_LOCK_BIT
+   bne-core_idle_lock_held
blr
 
 /*
-- 
1.9.1

RE: [PATCH V2 1/5] powerpc/mpc85xx: Update TMU device tree node for T1040/T1042

2016-10-24 Thread Troy Jia


> -Original Message-
> From: Shawn Guo [mailto:shawn...@kernel.org]
> Sent: Monday, October 24, 2016 4:37 PM
> To: Troy Jia 
> Cc: rui.zh...@intel.com; edubez...@gmail.com; robh...@kernel.org; Scott Wood
> ; devicet...@vger.kernel.org; linuxppc-
> d...@lists.ozlabs.org; linux-ker...@vger.kernel.org; linux-arm-
> ker...@lists.infradead.org
> Subject: Re: [PATCH V2 1/5] powerpc/mpc85xx: Update TMU device tree node for
> T1040/T1042
> 
> On Sun, Oct 09, 2016 at 02:47:02PM +0800, Jia Hongtao wrote:
> > From: Hongtao Jia 
> >
> > SoC compatible string and endianness property are added according to
> > the new bindings.
> 
> The commit log doesn't seem to match the actual changes.  Same for patch 2/5.

Right. I just revise the log and sent a new version. Please help to review.
Thanks.

-Hongtao.

> 
> >
> > Signed-off-by: Jia Hongtao 
> > ---
> > Changes for V2:
> > * Rebase on latest linux-next tree (next-20161006).
> >
> >  arch/powerpc/boot/dts/fsl/t1040si-post.dtsi | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
> > b/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
> > index 44e399b..145c7f4 100644
> > --- a/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
> > +++ b/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
> > @@ -526,7 +526,7 @@
> >
> >0x0003 0x0012
> >0x00030001 0x001d>;
> > -   #thermal-sensor-cells = <0>;
> > +   #thermal-sensor-cells = <1>;
> > };
> >
> > thermal-zones {
> > @@ -534,7 +534,7 @@
> > polling-delay-passive = <1000>;
> > polling-delay = <5000>;
> >
> > -   thermal-sensors = <>;
> > +   thermal-sensors = < 2>;
> >
> > trips {
> > cpu_alert: cpu-alert {
> > --
> > 2.1.0.27.g96db324
> >
> >
> > ___
> > linux-arm-kernel mailing list
> > linux-arm-ker...@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

[PATCH V3 2/2] powerpc/mpc85xx: Update TMU device tree node for T1023/T1024

2016-10-24 Thread Jia Hongtao

From: Hongtao Jia 

Update #thermal-sensor-cells from 0 to 1 according to the new binding. The
sensor specifier added is the monitoring site ID, and represents the "n" in
TRITSRn and TRATSRn.

Signed-off-by: Jia Hongtao 
---
Changes for V3:
* Update the commit log to a better description.

Changes for V2:
* Rebase on latest linux-next tree (next-20161006).

 arch/powerpc/boot/dts/fsl/t1023si-post.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t1023si-post.dtsi 
b/arch/powerpc/boot/dts/fsl/t1023si-post.dtsi
index 6e0b489..bce762a 100644
--- a/arch/powerpc/boot/dts/fsl/t1023si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t1023si-post.dtsi
@@ -321,7 +321,7 @@
   0x00030001 0x000d
   0x00030002 0x0019
   0x00030003 0x0024>;
-   #thermal-sensor-cells = <0>;
+   #thermal-sensor-cells = <1>;
};
 
thermal-zones {
@@ -329,7 +329,7 @@
polling-delay-passive = <1000>;
polling-delay = <5000>;
 
-   thermal-sensors = <>;
+   thermal-sensors = < 0>;
 
trips {
cpu_alert: cpu-alert {
-- 
2.1.0.27.g96db324

[PATCH V3 1/2] powerpc/mpc85xx: Update TMU device tree node for T1040/T1042

2016-10-24 Thread Jia Hongtao

From: Hongtao Jia 

Update #thermal-sensor-cells from 0 to 1 according to the new binding. The
sensor specifier added is the monitoring site ID, and represents the "n" in
TRITSRn and TRATSRn.

Signed-off-by: Jia Hongtao 
---
Changes for V3:
* Update the commit log to a better description.

Changes for V2:
* Rebase on latest linux-next tree (next-20161006).

 arch/powerpc/boot/dts/fsl/t1040si-post.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi 
b/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
index 44e399b..145c7f4 100644
--- a/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
@@ -526,7 +526,7 @@
 
   0x0003 0x0012
   0x00030001 0x001d>;
-   #thermal-sensor-cells = <0>;
+   #thermal-sensor-cells = <1>;
};
 
thermal-zones {
@@ -534,7 +534,7 @@
polling-delay-passive = <1000>;
polling-delay = <5000>;
 
-   thermal-sensors = <>;
+   thermal-sensors = < 2>;
 
trips {
cpu_alert: cpu-alert {
-- 
2.1.0.27.g96db324

Re: [PATCH v2 2/2] PCI: Don't disable PF's memory decoding when enabling SRIOV

2016-10-24 Thread Gavin Shan

On Mon, Oct 24, 2016 at 09:03:16AM -0500, Bjorn Helgaas wrote:
>On Mon, Oct 24, 2016 at 10:28:02AM +1100, Gavin Shan wrote:
>> On Fri, Oct 21, 2016 at 11:50:34AM -0500, Bjorn Helgaas wrote:
>> >Hi Gavin,
>> >
>> >On Fri, Sep 30, 2016 at 09:47:50AM +1000, Gavin Shan wrote:
>> >> pci_update_resource() might be called to update (shift) IOV BARs
>> >> in PPC PowerNV specific pcibios_sriov_enable() when enabling PF's
>> >> SRIOV capability. At that point, the PF have been functional if
>> >> the SRIOV is enabled through sysfs entry "sriov_numvfs". The PF's
>> >> memory decoding (0x2 in PCI_COMMAND) shouldn't be disabled when
>> >> updating its IOV BARs with pci_update_resource(). Otherwise, we
>> >> receives EEH error caused by MMIO access to PF's memory BARs during
>> >> the window when PF's memory decoding is disabled.
>> >
>> >The fact that you get EEH errors is irrelevant.  We can't write code
>> >simply to avoid errors -- we have to write code to make the system
>> >work correctly.
>> >
>> >I do not want to add a "mmio_force_on" parameter to
>> >pci_update_resource().  That puts the burden on the caller to
>> >understand this subtle issue.  If the caller passes mmio_force_on=1
>> >when it shouldn't, things will almost always work, but once in a blue
>> >moon a half-updated BAR will conflict with some other device in the
>> >system, and we'll have an unreproducible, undebuggable crash.
>> >
>> 
>> Bjorn, thanks for your comments. Yes, the EEH error was caused by MMIO
>> access to PF's normal BARs, not VF BARs. Yeah, I also had the conern
>> that adding parameter to pci_update_resource() increases the visible
>> complexity to the caller of the function.
>> 
>> >What you do need is an explanation of why it's safe to non-atomically
>> >update a VF BARx in the SR-IOV capability.  I think this probably
>> >involves the VF MSE bit, and you probably have to either disable VFs
>> >completely or clear the MSE bit while you're updating the VF BARx.  We
>> >should be able to do this inside pci_update_resource() without
>> >changing the interface.
>> >
>> 
>> Yes, It's what PATCH[1/2] does: (PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE)
>> are set after VF BARs are updated with pci_update_resource() in this PPC
>> specific scenario. There are other two situations where the IOV BARs are
>> updated: PCI resource resizing and allocation during system booting or hot
>> adding PF. The VF shouldn't be enabled in both cases when updating IOV BARs.
>> 
>> I think you suggest to add check in pci_update_resource(): Don't disable
>> PF's memory decoding when updating VF BARs. I will send updated revision
>> if it's what you want.
>> 
>>  /*
>>   * The PF might be functional when updating its IOV BARs. So PF's
>>   * memory decoding shouldn't be disabled when updating its IOV BARs.
>>   */
>>  disable = (res->flags & IORESOURCE_MEM_64) && !dev->mmio_always_on;
>> #ifdef CONFIG_PCI_IOV
>>  disable &= !(resno >= PCI_IOV_RESOURCES &&
>>   resno <= PCI_IOV_RESOURCE_END);
>> #endif
>>  if (disable) {
>>  pci_read_config_word(dev, PCI_COMMAND, );
>>  pci_write_config_word(dev, PCI_COMMAND,
>>cmd & ~PCI_COMMAND_MEMORY);
>>  }
>
>Not exactly.  A 64-bit BAR cannot be updated atomically.  The whole
>point of this exercise is to make sure that when we update such a BAR,
>whether it is a normal PCI BAR or an SR-IOV BAR, the transient state
>does not conflict with anything else in the system.
>
>For example, consider two devices that do not conflict:
>
>  device A BAR 0: 0x_2000
>  device B BAR 0: 0x_4000
>
>We want to update A's BAR 0 to 0x0001_4000.  We can't do the
>update atomically, so we have this sequence:
>
>  before update:device A BAR 0: 0x_2000
>  after writing lower half: device A BAR 0: 0x_4000
>  after writing upper half: device A BAR 0: 0x0001_4000
>
>If the driver for device B accesses B between the writes, both A and B
>may respond, which is a fatal error.
>

Thanks for the detailed explanation. Apart from pdev->mmio_always_on,
the normal BARs are updated with PCI_COMMAND_MEMORY cleared. With
PATCH[1/2], The IOV BARs are updated with PCI_SRIOV_CTRL_MSE and
PCI_SRIOV_CTRL_VFE cleared in the problematic path the patch tried
to address initially. However, I prefer your suggestion at end of
the reply.

>Probably the *best* thing would be to make pci_update_resource()
>return an error if it's asked to update a BAR that is enabled, but I
>haven't looked at all the callers to see whether that's practical.
>

It arguably enforces users to tackle the limitation: the memory
decoding (PCI_COMMAND_MEMORY or PCI_SRIOV_CTRL_VFE) should be
disabled before updating the BARs with pci_update_resource().
It means user cannot call APIs in relatively relaxed order
as before. For example, pci_enable_device() followed by
pci_update_resource(), which is allowed previously,

Re: [PATCH v4 5/5] x86, kvm: support vcpu preempted check

2016-10-24 Thread Pan Xinhui




在 2016/10/24 23:18, Paolo Bonzini 写道:



On 24/10/2016 17:14, Radim Krčmář wrote:

2016-10-24 16:39+0200, Paolo Bonzini:

On 19/10/2016 19:24, Radim Krčmář wrote:

+   if (vcpu->arch.st.msr_val & KVM_MSR_ENABLED)
+   if (kvm_read_guest_cached(vcpu->kvm, >arch.st.stime,
+   >arch.st.steal,
+   sizeof(struct kvm_steal_time)) == 0) {
+   vcpu->arch.st.steal.preempted = 1;
+   kvm_write_guest_cached(vcpu->kvm, >arch.st.stime,
+   >arch.st.steal,
+   sizeof(struct kvm_steal_time));
+   }

Please name this block of code.  Something like
  kvm_steal_time_set_preempted(vcpu);


While at it:

1) the kvm_read_guest_cached is not necessary.  You can rig the call to
kvm_write_guest_cached so that it only writes vcpu->arch.st.steal.preempted.


I agree.  kvm_write_guest_cached() always writes from offset 0, so we'd
want a new function that allows to specify a starting offset.


Yeah, let's leave it for a follow-up then!


I think I can make a having-offset version. :)


Thanks,

Paolo


Using cached vcpu->arch.st.steal to avoid the read wouldn't be as good.

Re: [PATCH v6] powerpc: Do not make the entire heap executable

2016-10-24 Thread Kees Cook

On Thu, Oct 20, 2016 at 3:45 PM, Jason Gunthorpe
 wrote:
> On Tue, Oct 04, 2016 at 09:54:12AM -0700, Kees Cook wrote:
>> On Mon, Oct 3, 2016 at 5:18 PM, Michael Ellerman  wrote:
>> > Kees Cook  writes:
>> >
>> >> On Mon, Oct 3, 2016 at 9:13 AM, Denys Vlasenko  
>> >> wrote:
>> >>> On 32-bit powerpc the ELF PLT sections of binaries (built with --bss-plt,
>> >>> or with a toolchain which defaults to it) look like this:
>> > ...
>> >>>
>> >>> Signed-off-by: Jason Gunthorpe 
>> >>> Signed-off-by: Denys Vlasenko 
>> >>> Acked-by: Kees Cook 
>> >>> Acked-by: Michael Ellerman 
>> >>> CC: Benjamin Herrenschmidt 
>> >>> CC: Paul Mackerras 
>> >>> CC: "Aneesh Kumar K.V" 
>> >>> CC: Kees Cook 
>> >>> CC: Oleg Nesterov 
>> >>> CC: Michael Ellerman 
>> >>> CC: Florian Weimer 
>> >>> CC: linux...@kvack.org
>> >>> CC: linuxppc-dev@lists.ozlabs.org
>> >>> CC: linux-ker...@vger.kernel.org
>> >>> Changes since v5:
>> >>> * made do_brk_flags() error out if any bits other than VM_EXEC are set.
>> >>>   (Kees Cook: "With this, I'd be happy to Ack.")
>> >>>   See https://patchwork.ozlabs.org/patch/661595/
>> >>
>> >> Excellent, thanks for the v6! Should this go via the ppc tree or the -mm 
>> >> tree?
>> >
>> > -mm would be best, given the diffstat I think it's less likely to
>> >  conflict if it goes via -mm.
>>
>> Okay, excellent. Andrew, do you have this already in email? I think
>> you weren't on the explicit CC from the v6...
>
> FWIW (and ping),
>
> Tested-by: Jason Gunthorpe 
>
> On ARM32 (kirkwood) and PPC32 (405)
>
> For reference, here is the patchwork URL:
>
> https://patchwork.ozlabs.org/patch/677753/

Hi Andrew,

Can you pick this up?

Thanks!

-Kees

-- 
Kees Cook
Nexus Security

Re: [PATCH] powerpc/pseries: fix spelling mistake: "Attemping" -> "Attempting"

2016-10-24 Thread Colin Ian King

On 24/10/16 23:20, Joe Perches wrote:
> On Mon, 2016-10-24 at 15:13 -0700, Kees Cook wrote:
>> On Mon, Oct 24, 2016 at 3:02 PM, Colin King  wrote:
>>> From: Colin Ian King 
>>>
>>> trivial fix to spelling mistake in pr_debug message
>>>
>>> Signed-off-by: Colin Ian King 
>>> ---
>>>  arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
>>> b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>>> index a1b63e0..c8929cb 100644
>>> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
>>> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>>> @@ -553,7 +553,7 @@ static ssize_t dlpar_cpu_remove(struct device_node *dn, 
>>> u32 drc_index)
>>>  {
>>> int rc;
>>>
>>> -   pr_debug("Attemping to remove CPU %s, drc index: %x\n",
>>> +   pr_debug("Attempting to remove CPU %s, drc index: %x\n",
>>>  dn->name, drc_index);
>>>
>>> rc = dlpar_offline_cpu(dn);
>>> --
>>> 2.9.3
>>
>> Reviewed-by: Kees Cook 
>>
>> scripts/spelling.txt should likely get an addition for "attemping". It
>> already has "attemps"...
> 
> There may not be enough of these to be useful and to my
> knowledge Colin isn't fixing comments

Indeed, I'm just using my kernelscan tool to scrape for errors in kernel
messages as these are seen by users as opposed to comments. Plus the
occurrences of the mistakes I'm finding are generally very small.

> 
> $ grep -rP --include=*.[ch] -wi "at+emp*t?ing" * | grep -vi attempting
> arch/x86/platform/uv/uv_nmi.c:/* Ping non-responding cpus attemping to force 
> them into the NMI handler */
> arch/powerpc/platforms/pseries/hotplug-cpu.c: pr_debug("Attemping to remove 
> CPU %s, drc index: %x\n",
>

Re: [PATCH] powerpc/pseries: fix spelling mistake: "Attemping" -> "Attempting"

2016-10-24 Thread Joe Perches

On Mon, 2016-10-24 at 15:13 -0700, Kees Cook wrote:
> On Mon, Oct 24, 2016 at 3:02 PM, Colin King  wrote:
> > From: Colin Ian King 
> > 
> > trivial fix to spelling mistake in pr_debug message
> > 
> > Signed-off-by: Colin Ian King 
> > ---
> >  arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
> > b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > index a1b63e0..c8929cb 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > @@ -553,7 +553,7 @@ static ssize_t dlpar_cpu_remove(struct device_node *dn, 
> > u32 drc_index)
> >  {
> > int rc;
> > 
> > -   pr_debug("Attemping to remove CPU %s, drc index: %x\n",
> > +   pr_debug("Attempting to remove CPU %s, drc index: %x\n",
> >  dn->name, drc_index);
> > 
> > rc = dlpar_offline_cpu(dn);
> > --
> > 2.9.3
> 
> Reviewed-by: Kees Cook 
> 
> scripts/spelling.txt should likely get an addition for "attemping". It
> already has "attemps"...

There may not be enough of these to be useful and to my
knowledge Colin isn't fixing comments

$ grep -rP --include=*.[ch] -wi "at+emp*t?ing" * | grep -vi attempting
arch/x86/platform/uv/uv_nmi.c:/* Ping non-responding cpus attemping to force 
them into the NMI handler */
arch/powerpc/platforms/pseries/hotplug-cpu.c:   pr_debug("Attemping to remove 
CPU %s, drc index: %x\n",

Re: [PATCH] powerpc/pseries: fix spelling mistake: "Attemping" -> "Attempting"

2016-10-24 Thread Kees Cook

On Mon, Oct 24, 2016 at 3:02 PM, Colin King  wrote:
> From: Colin Ian King 
>
> trivial fix to spelling mistake in pr_debug message
>
> Signed-off-by: Colin Ian King 
> ---
>  arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
> b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index a1b63e0..c8929cb 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -553,7 +553,7 @@ static ssize_t dlpar_cpu_remove(struct device_node *dn, 
> u32 drc_index)
>  {
> int rc;
>
> -   pr_debug("Attemping to remove CPU %s, drc index: %x\n",
> +   pr_debug("Attempting to remove CPU %s, drc index: %x\n",
>  dn->name, drc_index);
>
> rc = dlpar_offline_cpu(dn);
> --
> 2.9.3

Reviewed-by: Kees Cook 

scripts/spelling.txt should likely get an addition for "attemping". It
already has "attemps"...

-Kees

-- 
Kees Cook
Nexus Security

[PATCH] powerpc/pseries: fix spelling mistake: "Attemping" -> "Attempting"

2016-10-24 Thread Colin King

From: Colin Ian King 

trivial fix to spelling mistake in pr_debug message

Signed-off-by: Colin Ian King 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index a1b63e0..c8929cb 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -553,7 +553,7 @@ static ssize_t dlpar_cpu_remove(struct device_node *dn, u32 
drc_index)
 {
int rc;
 
-   pr_debug("Attemping to remove CPU %s, drc index: %x\n",
+   pr_debug("Attempting to remove CPU %s, drc index: %x\n",
 dn->name, drc_index);
 
rc = dlpar_offline_cpu(dn);
-- 
2.9.3

Re: [PATCH 3/3] corenet: Switch to of_platform_default_populate

2016-10-24 Thread Scott Wood

On 10/23/2016 06:48 PM, Andy Fleming wrote:
> Many of the embedded powerpc boards use an array of device names
> to register the devices from the device tree. Instead, we can use
> of_platform_default_populate(), which will iterate through all the root nodes 
> and register them.
> 
> Signed-off-by: Andy Fleming 
> ---
> This is necessary to enable the gpio power/reset pins on the Cyrus board.
> 
> I have confirmed that the PCI devices are registered and that the power/reset
> drivers are registered (and functional). I also tested USB, and saw the
> onboard MMC card report in, so some of the SoC devices are definitely there.
> 
> If this doesn't break the other corenet boards, I'd be happy to make the
> same change for the other powerpc boards.
> 
>  arch/powerpc/platforms/85xx/corenet_generic.c | 49 
> +--
>  1 file changed, 1 insertion(+), 48 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/85xx/corenet_generic.c 
> b/arch/powerpc/platforms/85xx/corenet_generic.c
> index 1179115..6347f20 100644
> --- a/arch/powerpc/platforms/85xx/corenet_generic.c
> +++ b/arch/powerpc/platforms/85xx/corenet_generic.c
> @@ -83,56 +83,9 @@ void __init corenet_gen_setup_arch(void)
>   mpc85xx_qe_init();
>  }
>  
> -static const struct of_device_id of_device_ids[] = {
> - {
> - .compatible = "simple-bus"
> - },
> - {
> - .compatible = "mdio-mux-gpio"
> - },
> - {
> - .compatible = "fsl,fpga-ngpixis"
> - },
> - {
> - .compatible = "fsl,fpga-qixis"
> - },
> - {
> - .compatible = "fsl,srio",
> - },
> - {
> - .compatible = "fsl,p4080-pcie",
> - },
> - {
> - .compatible = "fsl,qoriq-pcie-v2.2",
> - },
> - {
> - .compatible = "fsl,qoriq-pcie-v2.3",
> - },
> - {
> - .compatible = "fsl,qoriq-pcie-v2.4",
> - },
> - {
> - .compatible = "fsl,qoriq-pcie-v3.0",
> - },
> - {
> - .compatible = "fsl,qe",
> - },
> - {
> - .compatible= "fsl,fman",
> - },
> - /* The following two are for the Freescale hypervisor */
> - {
> - .name   = "hypervisor",
> - },
> - {
> - .name   = "handles",
> - },
> - {}
> -};
> -
>  int __init corenet_gen_publish_devices(void)
>  {
> - return of_platform_bus_probe(NULL, of_device_ids, NULL);
> + return of_platform_default_populate(NULL, NULL, NULL);
>  }

I think we still need to specify some of these nodes.  Switching from
probe to populate will take care of probing device nodes under root
without listing each one, but it won't treat those nodes as buses unless
they have "simple-bus" or otherwise match the probe list.

-Scott

[net-next PATCH RFC 17/26] arch/powerpc: Add option to skip DMA sync as a part of mapping

2016-10-24 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Alexander Duyck 
---
 arch/powerpc/kernel/dma.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index e64a601..6877e3f 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -203,6 +203,10 @@ static int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl,
for_each_sg(sgl, sg, nents, i) {
sg->dma_address = sg_phys(sg) + get_dma_offset(dev);
sg->dma_length = sg->length;
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
__dma_sync_page(sg_page(sg), sg->offset, sg->length, direction);
}
 
@@ -235,7 +239,10 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
 unsigned long attrs)
 {
BUG_ON(dir == DMA_NONE);
-   __dma_sync_page(page, offset, size, dir);
+
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync_page(page, offset, size, dir);
+
return page_to_phys(page) + offset + get_dma_offset(dev);
 }

Cyrus_5040.dts Misisng

2016-10-24 Thread luigi burdo

Hi Andy,

me and some others are testing the cyrus_5040 and we sow inside the Lnx kernel 
there isnt the cyrus_5040.dts.

i just attched my dmsg and my lshw reports where you can see:

on dmesg there are issue on ata2 device, if i burn a cdrom i have a failure in 
burining , copy from to another partition some time become really slow 300kb/s 
(sata1 SSD to sata1 ssd partition).

if more than 3.5gb of ram is used (right now i have 16gb) we face a dma errors 
on pci boards.

lshw dont show L3 cache i face this on cyrus_5020 too, is this normal?


if is needed something more you can ask freely

Thanks

Luigi


 dmesg
[0.00] Allocated 16384 bytes for 8 pacas at c0003fffc000
[0.00] MMU: Supported page sizes
[0.00]  4 KB as direct
[0.00]   4096 KB as direct
[0.00]  16384 KB as direct
[0.00]  65536 KB as direct
[0.00] 262144 KB as direct
[0.00]1048576 KB as direct
[0.00] MMU: Book3E HW tablewalk not supported
[0.00] Linux version 4.8.0-X5000-Tlosm-LB (amigaone@Amigaone) (gcc 
version 6.2.0 20160927 (Ubuntu 6.2.0-5ubuntu11) ) #1 SMP PREEMPT Mon Oct 3 
19:10:49 CEST 2016
[0.00] Using CoreNet Generic machine description
[0.00] Found legacy serial port 0 for /soc@ffe00/serial@11c500
[0.00]   mem=ffe11c500, taddr=ffe11c500, irq=0, clk=4, speed=0
[0.00] Found legacy serial port 1 for /soc@ffe00/serial@11c600
[0.00]   mem=ffe11c600, taddr=ffe11c600, irq=0, clk=4, speed=0
[0.00] Found legacy serial port 2 for /soc@ffe00/serial@11d500
[0.00]   mem=ffe11d500, taddr=ffe11d500, irq=0, clk=4, speed=0
[0.00] Found legacy serial port 3 for /soc@ffe00/serial@11d600
[0.00]   mem=ffe11d600, taddr=ffe11d600, irq=0, clk=4, speed=0
[0.00] CPU maps initialized for 1 thread per core
[0.00]  (thread shift is 0)
[0.00] Freed 8192 bytes for unused pacas
[0.00] -
[0.00] phys_mem_size = 0xdac0
[0.00] dcache_bsize  = 0x40
[0.00] icache_bsize  = 0x40
[0.00] cpu_features  = 0x00180400581802c0
[0.00]   possible= 0x00180480581802c0
[0.00]   always  = 0x00180400581802c0
[0.00] cpu_user_features = 0xcc008000 0x0800
[0.00] mmu_features  = 0x000a0010
[0.00] firmware_features = 0x
[0.00] -
[0.00] CoreNet Generic board
[0.00] Top of RAM: 0xdac0, Total RAM: 0xdac0
[0.00] Memory hole size: 0MB
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x-0x7fff]
[0.00]   DMA32empty
[0.00]   Normal   [mem 0x8000-0xdabf]
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x-0xdabf]
[0.00] Initmem setup node 0 [mem 0x-0xdabf]
[0.00] On node 0 totalpages: 896000
[0.00] free_area_init_node: node 0, pgdat c10b75c0, 
node_mem_map c000d7b94000
[0.00]   DMA zone: 7168 pages used for memmap
[0.00]   DMA zone: 0 pages reserved
[0.00]   DMA zone: 524288 pages, LIFO batch:31
[0.00]   Normal zone: 5082 pages used for memmap
[0.00]   Normal zone: 371712 pages, LIFO batch:31
[0.00] MMU: Allocated 2112 bytes of context maps for 255 contexts
[0.00] percpu: Embedded 19 pages/cpu @c000d7a0 s37096 r0 d40728 
u262144
[0.00] pcpu-alloc: s37096 r0 d40728 u262144 alloc=1*1048576
[0.00] pcpu-alloc: [0] 0 1 2 3 
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 883750
[0.00] Kernel command line: root=/dev/sda2 mem=3500M radeon.pcie_gen2=1 
pci=realloc
[0.00] PID hash table entries: 4096 (order: 3, 32768 bytes)
[0.00] Dentry cache hash table entries: 524288 (order: 10, 4194304 
bytes)
[0.00] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
[0.00] Sorting __ex_table...
[0.00] Memory: 3441728K/3584000K available (10660K kernel code, 1856K 
rwdata, 5556K rodata, 384K init, 1458K bss, 142272K reserved, 0K cma-reserved)
[0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[0.00] Preemptible hierarchical RCU implementation.
[0.00]  Build-time adjustment of leaf fanout to 64.
[0.00]  RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=4.
[0.00] RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=4
[0.00] NR_IRQS:512 nr_irqs:512 16
[0.00] mpic: Setting up MPIC " OpenPIC  " version 1.2 at ffe04, max 
4 CPUs
[0.00] mpic: ISU size: 512, shift: 9, mask: 1ff
[0.00]

Re: [PATCH V6 7/8] powerpc: Check arch.vec earlier during boot for memory features

2016-10-24 Thread Nathan Fontenot

On 09/21/2016 09:17 AM, Michael Bringmann wrote:
> architecture.vec5 features: The boot-time memory management needs to
> know the form of the "ibm,dynamic-memory-v2" property early during
> scanning of the flattened device tree.  This patch moves execution of
> the function pseries_probe_fw_features() early enough to be before
> the scanning of the memory properties in the device tree to allow
> recognition of the supported properties.
> 
> [V2: No change]
> [V3: Updated after commit 3808a88985b4f5f5e947c364debce4441a380fb8.]
> [V4: Update comments]
> [V5: Resynchronize/resubmit]
> [V6: Resync to v4.7 kernel code]
> 
> Signed-off-by: Michael Bringmann 
> ---
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 946e34f..2034edc 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -753,6 +753,9 @@ void __init early_init_devtree(void *params)
>*/
>   of_scan_flat_dt(early_init_dt_scan_chosen_ppc, boot_command_line);
> 
> + /* Now try to figure out if we are running on LPAR and so on */
> + pseries_probe_fw_features();
> +

I'll have to defer to others on whether calling this earlier in boot
is ok.

I do notice that you do not remove the call later on, any reason?

-Nathan

>   /* Scan memory nodes and rebuild MEMBLOCKs */
>   of_scan_flat_dt(early_init_dt_scan_root, NULL);
>   of_scan_flat_dt(early_init_dt_scan_memory_ppc, NULL);
>

[PATCH v5 1/3] powerpc/mm: allow memory hotplug into a memoryless node

2016-10-24 Thread Reza Arbab

Remove the check which prevents us from hotplugging into an empty node.

The original commit b226e4621245 ("[PATCH] powerpc: don't add memory to
empty node/zone"), states that this was intended to be a temporary measure.
It is a workaround for an oops which no longer occurs.

Signed-off-by: Reza Arbab 
Reviewed-by: Aneesh Kumar K.V 
Acked-by: Balbir Singh 
Cc: Nathan Fontenot 
Cc: Bharata B Rao 
---
 arch/powerpc/mm/numa.c | 13 +
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index a51c188..0cb6bd8 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1085,7 +1085,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
 int hot_add_scn_to_nid(unsigned long scn_addr)
 {
struct device_node *memory = NULL;
-   int nid, found = 0;
+   int nid;
 
if (!numa_enabled || (min_common_depth < 0))
return first_online_node;
@@ -1101,17 +1101,6 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
if (nid < 0 || !node_online(nid))
nid = first_online_node;
 
-   if (NODE_DATA(nid)->node_spanned_pages)
-   return nid;
-
-   for_each_online_node(nid) {
-   if (NODE_DATA(nid)->node_spanned_pages) {
-   found = 1;
-   break;
-   }
-   }
-
-   BUG_ON(!found);
return nid;
 }
 
-- 
1.8.3.1

[PATCH v5 3/3] mm: enable CONFIG_MOVABLE_NODE on non-x86 arches

2016-10-24 Thread Reza Arbab

To support movable memory nodes (CONFIG_MOVABLE_NODE), at least one of
the following must be true:

1. We're on x86. This arch has the capability to identify movable nodes
   at boot by parsing the ACPI SRAT, if the movable_node option is used.

2. Our config supports memory hotplug, which means that a movable node
   can be created by hotplugging all of its memory into ZONE_MOVABLE.

Fix the Kconfig definition of CONFIG_MOVABLE_NODE, which currently
recognizes (1), but not (2).

Signed-off-by: Reza Arbab 
Reviewed-by: Aneesh Kumar K.V 
Acked-by: Balbir Singh 
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index be0ee11..5d0818f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -153,7 +153,7 @@ config MOVABLE_NODE
bool "Enable to assign a node which has only movable memory"
depends on HAVE_MEMBLOCK
depends on NO_BOOTMEM
-   depends on X86_64
+   depends on X86_64 || MEMORY_HOTPLUG
depends on NUMA
default n
help
-- 
1.8.3.1

[PATCH v5 0/3] powerpc/mm: movable hotplug memory nodes

2016-10-24 Thread Reza Arbab

These changes enable the dynamic creation of movable nodes on power.

On x86, the ACPI SRAT memory affinity structure can mark memory
hotpluggable, allowing the kernel to possibly create movable nodes at
boot.

While power has no analog of this SRAT information, we can still create
a movable memory node, post boot, by hotplugging all of the node's
memory into ZONE_MOVABLE.

In v1, this patchset introduced a new dt compatible id to explicitly
create a memoryless node at boot. Here, things have been simplified to
be applicable regardless of the status of node hotplug on power. We
still intend to enable hotadding a pgdat, but that's now untangled as a
separate topic.

v5:
* Drop the patches which recognize the "status" property of dt memory
  nodes. Firmware can set the size of "linux,usable-memory" to zero instead.

v4:
* 
http://lkml.kernel.org/r/1475778995-1420-1-git-send-email-ar...@linux.vnet.ibm.com

* Rename of_fdt_is_available() to of_fdt_device_is_available().
  Rename of_flat_dt_is_available() to of_flat_dt_device_is_available().

* Instead of restoring top-down allocation, ensure it never goes
  bottom-up in the first place, by making movable_node arch-specific.

* Use MEMORY_HOTPLUG instead of PPC64 in the mm/Kconfig patch.

v3:
* 
http://lkml.kernel.org/r/1474828616-16608-1-git-send-email-ar...@linux.vnet.ibm.com

* Use Rob Herring's suggestions to improve the node availability check.

* More verbose commit log in the patch enabling CONFIG_MOVABLE_NODE.

* Add a patch to restore top-down allocation the way x86 does.

v2:
* 
http://lkml.kernel.org/r/1473883618-14998-1-git-send-email-ar...@linux.vnet.ibm.com

* Use the "status" property of standard dt memory nodes instead of
  introducing a new "ibm,hotplug-aperture" compatible id.

* Remove the patch which explicitly creates a memoryless node. This set
  no longer has any bearing on whether the pgdat is created at boot or
  at the time of memory addition.

v1:
* 
http://lkml.kernel.org/r/1470680843-28702-1-git-send-email-ar...@linux.vnet.ibm.com

Reza Arbab (3):
  powerpc/mm: allow memory hotplug into a memoryless node
  mm: make processing of movable_node arch-specific
  mm: enable CONFIG_MOVABLE_NODE on non-x86 arches

 arch/powerpc/mm/numa.c | 13 +
 arch/x86/mm/numa.c | 35 ++-
 mm/Kconfig |  2 +-
 mm/memory_hotplug.c| 31 ---
 4 files changed, 36 insertions(+), 45 deletions(-)

-- 
1.8.3.1

[PATCH v5 2/3] mm: make processing of movable_node arch-specific

2016-10-24 Thread Reza Arbab

Currently, CONFIG_MOVABLE_NODE depends on X86_64. In preparation to
enable it for other arches, we need to factor a detail which is unique
to x86 out of the generic mm code.

Specifically, as documented in kernel-parameters.txt, the use of
"movable_node" should remain restricted to x86:

movable_node[KNL,X86] Boot-time switch to enable the effects
of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.

This option tells x86 to find movable nodes identified by the ACPI SRAT.
On other arches, it would have no benefit, only the undesired side
effect of setting bottom-up memblock allocation.

Since #ifdef CONFIG_MOVABLE_NODE will no longer be enough to restrict
this option to x86, move it to an arch-specific compilation unit
instead.

Signed-off-by: Reza Arbab 
Reviewed-by: Aneesh Kumar K.V 
Acked-by: Balbir Singh 
---
 arch/x86/mm/numa.c  | 35 ++-
 mm/memory_hotplug.c | 31 ---
 2 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 3f35b48..37584ba 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -886,6 +886,38 @@ const struct cpumask *cpumask_of_node(int node)
 #endif /* !CONFIG_DEBUG_PER_CPU_MAPS */
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+
+static int __init cmdline_parse_movable_node(char *p)
+{
+#ifdef CONFIG_MOVABLE_NODE
+   /*
+* Memory used by the kernel cannot be hot-removed because Linux
+* cannot migrate the kernel pages. When memory hotplug is
+* enabled, we should prevent memblock from allocating memory
+* for the kernel.
+*
+* ACPI SRAT records all hotpluggable memory ranges. But before
+* SRAT is parsed, we don't know about it.
+*
+* The kernel image is loaded into memory at very early time. We
+* cannot prevent this anyway. So on NUMA system, we set any
+* node the kernel resides in as un-hotpluggable.
+*
+* Since on modern servers, one node could have double-digit
+* gigabytes memory, we can assume the memory around the kernel
+* image is also un-hotpluggable. So before SRAT is parsed, just
+* allocate memory near the kernel image to try the best to keep
+* the kernel away from hotpluggable memory.
+*/
+   memblock_set_bottom_up(true);
+   movable_node_enabled = true;
+#else
+   pr_warn("movable_node option not supported\n");
+#endif
+   return 0;
+}
+early_param("movable_node", cmdline_parse_movable_node);
+
 int memory_add_physaddr_to_nid(u64 start)
 {
struct numa_meminfo *mi = _meminfo;
@@ -898,4 +930,5 @@ int memory_add_physaddr_to_nid(u64 start)
return nid;
 }
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
-#endif
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9629273..9931e7e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1738,37 +1738,6 @@ static bool can_offline_normal(struct zone *zone, 
unsigned long nr_pages)
 }
 #endif /* CONFIG_MOVABLE_NODE */
 
-static int __init cmdline_parse_movable_node(char *p)
-{
-#ifdef CONFIG_MOVABLE_NODE
-   /*
-* Memory used by the kernel cannot be hot-removed because Linux
-* cannot migrate the kernel pages. When memory hotplug is
-* enabled, we should prevent memblock from allocating memory
-* for the kernel.
-*
-* ACPI SRAT records all hotpluggable memory ranges. But before
-* SRAT is parsed, we don't know about it.
-*
-* The kernel image is loaded into memory at very early time. We
-* cannot prevent this anyway. So on NUMA system, we set any
-* node the kernel resides in as un-hotpluggable.
-*
-* Since on modern servers, one node could have double-digit
-* gigabytes memory, we can assume the memory around the kernel
-* image is also un-hotpluggable. So before SRAT is parsed, just
-* allocate memory near the kernel image to try the best to keep
-* the kernel away from hotpluggable memory.
-*/
-   memblock_set_bottom_up(true);
-   movable_node_enabled = true;
-#else
-   pr_warn("movable_node option not supported\n");
-#endif
-   return 0;
-}
-early_param("movable_node", cmdline_parse_movable_node);
-
 /* check which state of node_states will be changed when offline memory */
 static void node_states_check_changes_offline(unsigned long nr_pages,
struct zone *zone, struct memory_notify *arg)
-- 
1.8.3.1

Re: [PATCH V6 5/8] pseries/drc-info: Search new DRC properties for CPU indexes

2016-10-24 Thread Nathan Fontenot

On 09/21/2016 09:17 AM, Michael Bringmann wrote:
> pseries/drc-info: Provide parallel routines to convert between
> drc_index and CPU numbers at runtime, using the older device-tree
> properties ("ibm,drc-indexes", "ibm,drc-names", "ibm,drc-types"
> and "ibm,drc-power-domains"), or the new property "ibm,drc-info".
> 
> [V2: Revise contant names.]
> [V3: No change.]
> [V4: No change.]
> [V5: Resynchronize/resubmit]
> [V6: No change]
> 
> Signed-off-by: Michael Bringmann 
> ---
> diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
> b/arch/powerpc/platforms/pseries/pseries_energy.c
> index 9276779..10c4200 100644
> --- a/arch/powerpc/platforms/pseries/pseries_energy.c
> +++ b/arch/powerpc/platforms/pseries/pseries_energy.c
> @@ -35,10 +35,68 @@ static int sysfs_entries;
> 
>  /* Helper Routines to convert between drc_index to cpu numbers */
> 
> +void read_one_drc_info(int **info, char **dtype, char **dname,
> + unsigned long int *fdi_p, unsigned long int *nsl_p,
> + unsigned long int *si_p, unsigned long int *ldi_p)
> +{
> + char *drc_type, *drc_name, *pc;
> + u32 fdi, nsl, si, ldi;
> +
> + fdi = nsl = si = ldi = 0;
> +
> + /* Get drc-type:encode-string */
> + pc = (char *)info;
> + drc_type = pc;
> + pc += (strlen(drc_type) + 1);
> +
> + /* Get drc-name-prefix:encode-string */
> + drc_name = (char *)pc;
> + pc += (strlen(drc_name) + 1);
> +
> + /* Get drc-index-start:encode-int */
> + memcpy(, pc, 4);

This may be a little nit, but could we get the shortened names for the
elements to match up with the comments, i.e. make the comment state
'Get first drc-index' to match fdi.

> + fdi = be32_to_cpu(fdi);
> + pc += 4;
> +
> + /* Get/skip drc-name-suffix-start:encode-int */
> + pc += 4;
> +
> + /* Get number-sequential-elements:encode-int */
> + memcpy(, pc, 4);
> + nsl = be32_to_cpu(nsl);
> + pc += 4;
> +
> + /* Get sequential-increment:encode-int */
> + memcpy(, pc, 4);
> + si = be32_to_cpu(si);
> + pc += 4;
> +
> + /* Get/skip drc-power-domain:encode-int */
> + pc += 4;
> +
> + /* Should now know end of current entry */
> + ldi = fdi + ((nsl-1)*si);
> +
> + (*info) = (int *)pc;
> +
> + if (dtype)
> + *dtype = drc_type;
> + if (dname)
> + *dname = drc_name;
> + if (fdi_p)
> + *fdi_p = fdi;
> + if (nsl_p)
> + *nsl_p = nsl;
> + if (si_p)
> + *si_p = si;
> + if (ldi_p)
> + *ldi_p = ldi;
> +}
> +EXPORT_SYMBOL(read_one_drc_info);
> +
>  static u32 cpu_to_drc_index(int cpu)
>  {
>   struct device_node *dn = NULL;
> - const int *indexes;
>   int i;
>   int rc = 1;
>   u32 ret = 0;
> @@ -46,18 +104,54 @@ static u32 cpu_to_drc_index(int cpu)
>   dn = of_find_node_by_path("/cpus");
>   if (dn == NULL)
>   goto err;
> - indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
> - if (indexes == NULL)
> - goto err_of_node_put;
> +
>   /* Convert logical cpu number to core number */
>   i = cpu_core_index_of_thread(cpu);

Given the changes you've made to this routine I think we should re-name
i to core_index (or some such name), I had to read through a couple of
times to see where 'i' is used.

> - /*
> -  * The first element indexes[0] is the number of drc_indexes
> -  * returned in the list.  Hence i+1 will get the drc_index
> -  * corresponding to core number i.
> -  */
> - WARN_ON(i > indexes[0]);
> - ret = indexes[i + 1];
> +
> + if (firmware_has_feature(FW_FEATURE_DRC_INFO)) {
> + int *info = (int *)4;

Why set info to 4? This doesn't appear necessary.

> + unsigned long int num_set_entries, j, iw = i, fdi = 0;
> + unsigned long int ldi = 0, nsl = 0, si = 0;
> + char *dtype;
> + char *dname;
> +
> + info = (int *)of_get_property(dn, "ibm,drc-info", NULL);
> + if (info == NULL)
> + goto err_of_node_put;
> +
> + num_set_entries = be32_to_cpu(*info++);
> +
> + for (j = 0; j < num_set_entries; j++) {
> +
> + read_one_drc_info(, , , ,
> + , , );
> + if (strcmp(dtype, "CPU"))
> + goto err;
> +
> + if (iw < ldi)
> + break;
> +
> + WARN_ON(((iw-fdi)%si) != 0);
> + }
> + WARN_ON((nsl == 0) | (si == 0));
> +
> + ret = ldi + (iw*si);
> + } else {
> + const int *indexes;
> +
> + indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
> + if (indexes == NULL)
> + goto err_of_node_put;
> +
> + /*
> +  * The first element indexes[0] is

Applied "ASoC: constify snd_soc_ops structures" to the asoc tree

2016-10-24 Thread Mark Brown

The patch

   ASoC: constify snd_soc_ops structures

has been applied to the asoc tree at

   git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git 

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From ddba7fa4cfd16b10adcbe24fa1d3b7de470af863 Mon Sep 17 00:00:00 2001
From: Julia Lawall 
Date: Sat, 15 Oct 2016 16:55:50 +0200
Subject: [PATCH] ASoC: constify snd_soc_ops structures

Check for snd_soc_ops structures that are only stored in the ops field of a
snd_soc_dai_link structure.  This field is declared const, so snd_soc_ops
structures that have this property can be declared as const also.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// 
@r disable optional_qualifier@
identifier i;
position p;
@@
static struct snd_soc_ops i@p = { ... };

@ok1@
identifier r.i;
struct snd_soc_dai_link e;
position p;
@@
e.ops = @p;

@ok2@
identifier r.i, e;
position p;
@@
struct snd_soc_dai_link e[] = { ..., { .ops = @p, }, ..., };

@bad@
position p != {r.p,ok1.p,ok2.p};
identifier r.i;
struct snd_soc_ops e;
@@
e@i@p

@depends on !bad disable optional_qualifier@
identifier r.i;
@@
static
+const
 struct snd_soc_ops i = { ... };
// 

The effect on the layout of the .o files is shown by the following output of
the size command, first before then after the transformation:

  textdata bss dec hex filename
   87481024   09772262c sound/soc/fsl/fsl-asoc-card.o
   8812 952   097642624 sound/soc/fsl/fsl-asoc-card.o

   textdata bss dec hex filename
   4165 264   844371155 sound/soc/fsl/imx-wm8962.o
   4229 200   844371155 sound/soc/fsl/imx-wm8962.o

Signed-off-by: Julia Lawall 
Signed-off-by: Mark Brown 
---
 sound/soc/fsl/fsl-asoc-card.c | 2 +-
 sound/soc/fsl/imx-wm8962.c| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/sound/soc/fsl/fsl-asoc-card.c b/sound/soc/fsl/fsl-asoc-card.c
index dffd549a0e2a..9998aea23597 100644
--- a/sound/soc/fsl/fsl-asoc-card.c
+++ b/sound/soc/fsl/fsl-asoc-card.c
@@ -183,7 +183,7 @@ static int fsl_asoc_card_hw_params(struct snd_pcm_substream 
*substream,
return 0;
 }
 
-static struct snd_soc_ops fsl_asoc_card_ops = {
+static const struct snd_soc_ops fsl_asoc_card_ops = {
.hw_params = fsl_asoc_card_hw_params,
 };
 
diff --git a/sound/soc/fsl/imx-wm8962.c b/sound/soc/fsl/imx-wm8962.c
index 201a70d1027a..1b60958e2080 100644
--- a/sound/soc/fsl/imx-wm8962.c
+++ b/sound/soc/fsl/imx-wm8962.c
@@ -61,7 +61,7 @@ static int imx_hifi_hw_params(struct snd_pcm_substream 
*substream,
return 0;
 }
 
-static struct snd_soc_ops imx_hifi_ops = {
+static const struct snd_soc_ops imx_hifi_ops = {
.hw_params = imx_hifi_hw_params,
 };
 
-- 
2.8.1

Applied "ASoC: davinci-mcbsp: DT fix s/interrupts-names/interrupt-names/" to the asoc tree

2016-10-24 Thread Mark Brown

The patch

   ASoC: davinci-mcbsp: DT fix s/interrupts-names/interrupt-names/

has been applied to the asoc tree at

   git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git 

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From a545f5d859c7988ab61850395a4565bfe507dc0a Mon Sep 17 00:00:00 2001
From: Geert Uytterhoeven 
Date: Fri, 21 Oct 2016 10:51:15 +0200
Subject: [PATCH] ASoC: davinci-mcbsp: DT fix
 s/interrupts-names/interrupt-names/

Signed-off-by: Geert Uytterhoeven 
Acked-by: Rob Herring 
Signed-off-by: Mark Brown 
---
 Documentation/devicetree/bindings/sound/davinci-mcbsp.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/sound/davinci-mcbsp.txt 
b/Documentation/devicetree/bindings/sound/davinci-mcbsp.txt
index 55b53e1fd72c..e0b6165c9cfc 100644
--- a/Documentation/devicetree/bindings/sound/davinci-mcbsp.txt
+++ b/Documentation/devicetree/bindings/sound/davinci-mcbsp.txt
@@ -43,7 +43,7 @@ mcbsp0: mcbsp@1d1 {
<0x0031 0x1000>;
reg-names = "mpu", "dat";
interrupts = <97 98>;
-   interrupts-names = "rx", "tx";
+   interrupt-names = "rx", "tx";
dmas = < 3 1
 2 1>;
dma-names = "tx", "rx";
-- 
2.8.1

[PATCH v2] powerpc/fadump: Fix the race in crash_fadump().

2016-10-24 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

There are chances that multiple CPUs can call crash_fadump() simultaneously
and would start duplicating same info to vmcoreinfo ELF note section. This
causes makedumpfile to fail during kdump capture. One example is,
triggering dumprestart from HMC which sends system reset to all the CPUs at
once.

makedumpfile --dump-dmesg /proc/vmcore
read_vmcoreinfo_basic_info: Invalid data in /tmp/vmcoreinfoyjgxlL: 
CRASHTIME=1475605971CRASHTIME=1475605971CRASHTIME=1475605971CRASHTIME=1475605971CRASHTIME=1475605971CRASHTIME=1475605971CRASHTIME=1475605971CRASHTIME=1475605971
makedumpfile Failed.
Running makedumpfile --dump-dmesg /proc/vmcore failed (1).

makedumpfile  -d 31 -l /proc/vmcore
read_vmcoreinfo_basic_info: Invalid data in /tmp/vmcoreinfo1mmVdO: 
CRASHTIME=1475605971CRASHTIME=1475605971CRASHTIME=1475605971CRASHTIME=1475605971CRASHTIME=1475605971CRASHTIME=1475605971CRASHTIME=1475605971CRASHTIME=1475605971
makedumpfile Failed.
Running makedumpfile  -d 31 -l /proc/vmcore failed (1).

Signed-off-by: Mahesh Salgaonkar 
---
Changes in V2:
- Use cmpxchg instead of mutex lock during panic path.
---
 arch/powerpc/kernel/fadump.c |   25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index b3a6633..a795956 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -401,12 +401,35 @@ static void register_fw_dump(struct fadump_mem_struct 
*fdm)
 void crash_fadump(struct pt_regs *regs, const char *str)
 {
struct fadump_crash_info_header *fdh = NULL;
+   int old_cpu, this_cpu;
 
if (!fw_dump.dump_registered || !fw_dump.fadumphdr_addr)
return;
 
+   /*
+* old_cpu == -1 means this is the first CPU which has come here,
+* go ahead and trigger fadump.
+*
+* old_cpu != -1 means some other CPU has already on it's way
+* to trigger fadump, just keep looping here.
+*/
+   this_cpu = smp_processor_id();
+   old_cpu = cmpxchg(_cpu, -1, this_cpu);
+
+   if (old_cpu != -1) {
+   /*
+* We can't loop here indefinitely. Wait as long as fadump
+* is in force. If we race with fadump un-registration this
+* loop will break and then we go down to normal panic path
+* and reboot. If fadump is in force the first crashing
+* cpu will definitely trigger fadump.
+*/
+   while (fw_dump.dump_registered)
+   cpu_relax();
+   return;
+   }
+
fdh = __va(fw_dump.fadumphdr_addr);
-   crashing_cpu = smp_processor_id();
fdh->crashing_cpu = crashing_cpu;
crash_save_vmcoreinfo();

Re: [PATCH v4 2/5] drivers/of: do not add memory for unavailable nodes

2016-10-24 Thread Reza Arbab


On Mon, Oct 24, 2016 at 09:24:04PM +1100, Michael Ellerman wrote:
The code already looks for "linux,usable-memory" in preference to 
"reg". Can you use that instead?


Yes, we could set the size of "linux,usable-memory" to zero instead of 
setting status to "disabled".


I'll send a v5 of this set which drops 1/5 and 2/5. That would be the 
only difference here.



That would have the advantage that existing kernels already understand
it.

Another problem with using "status" is we could have device trees out
there that have status = disabled and we don't know about it, and by
changing the kernel to use that property we break people's systems.
Though for memory nodes my guess is that's not true, but you never know ...


--
Reza Arbab

Re: [PATCH v4 5/5] x86, kvm: support vcpu preempted check

2016-10-24 Thread Paolo Bonzini



On 24/10/2016 17:14, Radim Krčmář wrote:
> 2016-10-24 16:39+0200, Paolo Bonzini:
>> On 19/10/2016 19:24, Radim Krčmář wrote:
> + if (vcpu->arch.st.msr_val & KVM_MSR_ENABLED)
> + if (kvm_read_guest_cached(vcpu->kvm, >arch.st.stime,
> + >arch.st.steal,
> + sizeof(struct kvm_steal_time)) == 0) {
> + vcpu->arch.st.steal.preempted = 1;
> + kvm_write_guest_cached(vcpu->kvm, >arch.st.stime,
> + >arch.st.steal,
> + sizeof(struct kvm_steal_time));
> + }
>>> Please name this block of code.  Something like
>>>   kvm_steal_time_set_preempted(vcpu);
>>
>> While at it:
>>
>> 1) the kvm_read_guest_cached is not necessary.  You can rig the call to
>> kvm_write_guest_cached so that it only writes vcpu->arch.st.steal.preempted.
> 
> I agree.  kvm_write_guest_cached() always writes from offset 0, so we'd
> want a new function that allows to specify a starting offset.

Yeah, let's leave it for a follow-up then!

Thanks,

Paolo

> Using cached vcpu->arch.st.steal to avoid the read wouldn't be as good.
>

Re: [PATCH v4 5/5] x86, kvm: support vcpu preempted check

2016-10-24 Thread Radim Krčmář

2016-10-24 16:39+0200, Paolo Bonzini:
> On 19/10/2016 19:24, Radim Krčmář wrote:
>>> > + if (vcpu->arch.st.msr_val & KVM_MSR_ENABLED)
>>> > + if (kvm_read_guest_cached(vcpu->kvm, >arch.st.stime,
>>> > + >arch.st.steal,
>>> > + sizeof(struct kvm_steal_time)) == 0) {
>>> > + vcpu->arch.st.steal.preempted = 1;
>>> > + kvm_write_guest_cached(vcpu->kvm, >arch.st.stime,
>>> > + >arch.st.steal,
>>> > + sizeof(struct kvm_steal_time));
>>> > + }
>> Please name this block of code.  Something like
>>   kvm_steal_time_set_preempted(vcpu);
> 
> While at it:
> 
> 1) the kvm_read_guest_cached is not necessary.  You can rig the call to
> kvm_write_guest_cached so that it only writes vcpu->arch.st.steal.preempted.

I agree.  kvm_write_guest_cached() always writes from offset 0, so we'd
want a new function that allows to specify a starting offset.

Using cached vcpu->arch.st.steal to avoid the read wouldn't be as good.

Re: [PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check

2016-10-24 Thread rkrc...@redhat.com

2016-10-24 16:42+0200, Paolo Bonzini:
> On 21/10/2016 20:39, rkrc...@redhat.com wrote:
>> 2016-10-21 11:27+, David Laight:
>>> From: Pan Xinhui
 Sent: 20 October 2016 22:28
 Commit ("x86, kvm: support vcpu preempted check") add one field "__u8
 preempted" into struct kvm_steal_time. This field tells if one vcpu is
 running or not.

 It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is
 preempted. Other values means the vcpu has been preempted.

 Signed-off-by: Pan Xinhui 
 ---
  Documentation/virtual/kvm/msr.txt | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

 diff --git a/Documentation/virtual/kvm/msr.txt 
 b/Documentation/virtual/kvm/msr.txt
 index 2a71c8f..3376f13 100644
 --- a/Documentation/virtual/kvm/msr.txt
 +++ b/Documentation/virtual/kvm/msr.txt
 @@ -208,7 +208,8 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
__u64 steal;
__u32 version;
__u32 flags;
 -  __u32 pad[12];
 +  __u8  preempted;
 +  __u32 pad[11];
}
>>>
>>> I think I'd be explicit about the 3 pad bytes you've left.
>> 
>> Seconded.
>> 
>> With that change are all KVM bits
>> 
>> Acked-by: Radim Krčmář 
> 
> Saw this after replying to the previous message.  If you need to post v6
> of the full series, it would be nice if you removed the
> kvm_read_guest_cached.  But anyway it wasn't my intention to override Radim.

The patch was acceptable to me even now, so I definitely wouldn't mind
if it were even nicer. :)

Re: [PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check

2016-10-24 Thread Paolo Bonzini



On 21/10/2016 20:39, rkrc...@redhat.com wrote:
> 2016-10-21 11:27+, David Laight:
>> From: Pan Xinhui
>>> Sent: 20 October 2016 22:28
>>> Commit ("x86, kvm: support vcpu preempted check") add one field "__u8
>>> preempted" into struct kvm_steal_time. This field tells if one vcpu is
>>> running or not.
>>>
>>> It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is
>>> preempted. Other values means the vcpu has been preempted.
>>>
>>> Signed-off-by: Pan Xinhui 
>>> ---
>>>  Documentation/virtual/kvm/msr.txt | 8 +++-
>>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/virtual/kvm/msr.txt 
>>> b/Documentation/virtual/kvm/msr.txt
>>> index 2a71c8f..3376f13 100644
>>> --- a/Documentation/virtual/kvm/msr.txt
>>> +++ b/Documentation/virtual/kvm/msr.txt
>>> @@ -208,7 +208,8 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
>>> __u64 steal;
>>> __u32 version;
>>> __u32 flags;
>>> -   __u32 pad[12];
>>> +   __u8  preempted;
>>> +   __u32 pad[11];
>>> }
>>
>> I think I'd be explicit about the 3 pad bytes you've left.
> 
> Seconded.
> 
> With that change are all KVM bits
> 
> Acked-by: Radim Krčmář 

Saw this after replying to the previous message.  If you need to post v6
of the full series, it would be nice if you removed the
kvm_read_guest_cached.  But anyway it wasn't my intention to override Radim.

Paolo

Re: [PATCH v4 5/5] x86, kvm: support vcpu preempted check

2016-10-24 Thread Paolo Bonzini



On 19/10/2016 19:24, Radim Krčmář wrote:
>> > +  if (vcpu->arch.st.msr_val & KVM_MSR_ENABLED)
>> > +  if (kvm_read_guest_cached(vcpu->kvm, >arch.st.stime,
>> > +  >arch.st.steal,
>> > +  sizeof(struct kvm_steal_time)) == 0) {
>> > +  vcpu->arch.st.steal.preempted = 1;
>> > +  kvm_write_guest_cached(vcpu->kvm, >arch.st.stime,
>> > +  >arch.st.steal,
>> > +  sizeof(struct kvm_steal_time));
>> > +  }
> Please name this block of code.  Something like
>   kvm_steal_time_set_preempted(vcpu);

While at it:

1) the kvm_read_guest_cached is not necessary.  You can rig the call to
kvm_write_guest_cached so that it only writes vcpu->arch.st.steal.preempted.

2) split the patch in host and guest sides.

Paolo

Re: [PATCH v2 2/2] PCI: Don't disable PF's memory decoding when enabling SRIOV

2016-10-24 Thread Bjorn Helgaas

On Mon, Oct 24, 2016 at 10:28:02AM +1100, Gavin Shan wrote:
> On Fri, Oct 21, 2016 at 11:50:34AM -0500, Bjorn Helgaas wrote:
> >Hi Gavin,
> >
> >On Fri, Sep 30, 2016 at 09:47:50AM +1000, Gavin Shan wrote:
> >> pci_update_resource() might be called to update (shift) IOV BARs
> >> in PPC PowerNV specific pcibios_sriov_enable() when enabling PF's
> >> SRIOV capability. At that point, the PF have been functional if
> >> the SRIOV is enabled through sysfs entry "sriov_numvfs". The PF's
> >> memory decoding (0x2 in PCI_COMMAND) shouldn't be disabled when
> >> updating its IOV BARs with pci_update_resource(). Otherwise, we
> >> receives EEH error caused by MMIO access to PF's memory BARs during
> >> the window when PF's memory decoding is disabled.
> >
> >The fact that you get EEH errors is irrelevant.  We can't write code
> >simply to avoid errors -- we have to write code to make the system
> >work correctly.
> >
> >I do not want to add a "mmio_force_on" parameter to
> >pci_update_resource().  That puts the burden on the caller to
> >understand this subtle issue.  If the caller passes mmio_force_on=1
> >when it shouldn't, things will almost always work, but once in a blue
> >moon a half-updated BAR will conflict with some other device in the
> >system, and we'll have an unreproducible, undebuggable crash.
> >
> 
> Bjorn, thanks for your comments. Yes, the EEH error was caused by MMIO
> access to PF's normal BARs, not VF BARs. Yeah, I also had the conern
> that adding parameter to pci_update_resource() increases the visible
> complexity to the caller of the function.
> 
> >What you do need is an explanation of why it's safe to non-atomically
> >update a VF BARx in the SR-IOV capability.  I think this probably
> >involves the VF MSE bit, and you probably have to either disable VFs
> >completely or clear the MSE bit while you're updating the VF BARx.  We
> >should be able to do this inside pci_update_resource() without
> >changing the interface.
> >
> 
> Yes, It's what PATCH[1/2] does: (PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE)
> are set after VF BARs are updated with pci_update_resource() in this PPC
> specific scenario. There are other two situations where the IOV BARs are
> updated: PCI resource resizing and allocation during system booting or hot
> adding PF. The VF shouldn't be enabled in both cases when updating IOV BARs.
> 
> I think you suggest to add check in pci_update_resource(): Don't disable
> PF's memory decoding when updating VF BARs. I will send updated revision
> if it's what you want.
> 
>   /*
>* The PF might be functional when updating its IOV BARs. So PF's
>* memory decoding shouldn't be disabled when updating its IOV BARs.
>*/
>   disable = (res->flags & IORESOURCE_MEM_64) && !dev->mmio_always_on;
> #ifdef CONFIG_PCI_IOV
>   disable &= !(resno >= PCI_IOV_RESOURCES &&
>resno <= PCI_IOV_RESOURCE_END);
> #endif
>   if (disable) {
>   pci_read_config_word(dev, PCI_COMMAND, );
>   pci_write_config_word(dev, PCI_COMMAND,
> cmd & ~PCI_COMMAND_MEMORY);
>   }

Not exactly.  A 64-bit BAR cannot be updated atomically.  The whole
point of this exercise is to make sure that when we update such a BAR,
whether it is a normal PCI BAR or an SR-IOV BAR, the transient state
does not conflict with anything else in the system.

For example, consider two devices that do not conflict:

  device A BAR 0: 0x_2000
  device B BAR 0: 0x_4000

We want to update A's BAR 0 to 0x0001_4000.  We can't do the
update atomically, so we have this sequence:

  before update:device A BAR 0: 0x_2000
  after writing lower half: device A BAR 0: 0x_4000
  after writing upper half: device A BAR 0: 0x0001_4000

If the driver for device B accesses B between the writes, both A and B
may respond, which is a fatal error.

Probably the *best* thing would be to make pci_update_resource()
return an error if it's asked to update a BAR that is enabled, but I
haven't looked at all the callers to see whether that's practical.

The current strategy in pci_update_resource() is to clear
PCI_COMMAND_MEMORY when updating a 64-bit memory BAR.  This only
applies to the regular PCI BARs 0-5.

Extending that strategy to SR-IOV would mean clearing
PCI_SRIOV_CTRL_MSE when updating a 64-bit VF BAR.  Obviously you
wouldn't clear PCI_COMMAND_MEMORY in this case because
PCI_COMMAND_MEMORY doesn't affect the VF BARs.

Bjorn

Re: [PATCH] kernel: irq: fix build failure

2016-10-24 Thread Lee Jones

On Mon, 24 Oct 2016, Stephen Rothwell wrote:
> On Mon, 24 Oct 2016 11:22:15 +0100 Lee Jones  wrote:
> > On Fri, 21 Oct 2016, Thomas Gleixner wrote:
> > 
> > > On Fri, 21 Oct 2016, Stephen Rothwell wrote:  
> > > > On Thu, 20 Oct 2016 14:55:45 +0200 (CEST) Thomas Gleixner 
> > > >  wrote:  
> > > > > I know. This is under discussion with the driver folks as we are not 
> > > > > going
> > > > > to blindly export stuff just because someone slapped a 
> > > > > irq_set_parent()
> > > > > into the code w/o knowing why.  
> > > > 
> > > > Do we have any idea if a resolution is close.  This was first reported
> > > > in linux-next in September 14/15.  :-(  
> > > 
> > > Grr. Yes. As much as I hate it, I'll go and export it for now. Should be
> > > able to get it into rc2.  
> > 
> > Did this get in?  I still have people complaining about it.
> 
> It is in -rc2.  Commit 3118dac501bc.

Ah, I was searching for patches authored my Thomas.

I see it now, thanks.

-- 
Lee Jones
Linaro STMicroelectronics Landing Team Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

Re: [PATCH] kernel: irq: fix build failure

2016-10-24 Thread Stephen Rothwell

Hi Lee,

On Mon, 24 Oct 2016 11:22:15 +0100 Lee Jones  wrote:
>
> On Fri, 21 Oct 2016, Thomas Gleixner wrote:
> 
> > On Fri, 21 Oct 2016, Stephen Rothwell wrote:  
> > > On Thu, 20 Oct 2016 14:55:45 +0200 (CEST) Thomas Gleixner 
> > >  wrote:  
> > > > I know. This is under discussion with the driver folks as we are not 
> > > > going
> > > > to blindly export stuff just because someone slapped a irq_set_parent()
> > > > into the code w/o knowing why.  
> > > 
> > > Do we have any idea if a resolution is close.  This was first reported
> > > in linux-next in September 14/15.  :-(  
> > 
> > Grr. Yes. As much as I hate it, I'll go and export it for now. Should be
> > able to get it into rc2.  
> 
> Did this get in?  I still have people complaining about it.

It is in -rc2.  Commit 3118dac501bc.

-- 
Cheers,
Stephen Rothwell

Re: Canyonlands oops at Shutdown

2016-10-24 Thread Julian Margetson


On 21-Oct-16 6:53 AM, Julian Margetson wrote:

On 08-May-16 10:12 AM, Julian Margetson wrote:

CONFIG_PPC44x_SIMPLE=y

CONFIG_POWER_RESET=y
CONFIG_POWER_RESET_GPIO is not set
# CONFIG_POWER_RESET_GPIO_RESTART is not set
# CONFIG_POWER_RESET_LTC2952 is not set
# CONFIG_POWER_RESET_RESTART is not set
# CONFIG_POWER_RESET_SYSCON is not set
# CONFIG_POWER_RESET_SYSCON_POWEROFF is not set

CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
# CONFIG_WATCHDOG_NOWAYOUT is not set
# CONFIG_WATCHDOG_SYSFS is not set
#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_GPIO_WATCHDOG is not set
# CONFIG_XILINX_WATCHDOG is not set
# CONFIG_ZIIRAVE_WATCHDOG is not set
# CONFIG_CADENCE_WATCHDOG is not set
CONFIG_DW_WATCHDOG=y
# CONFIG_MAX63XX_WATCHDOG is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_I6300ESB_WDT is not set
# CONFIG_BOOKE_WDT is not set
# CONFIG_MEN_A21_WDT is not set


Issue appears again with kernel 4.9-rc1




Working again with 4.9-rc2

Re: Canyonlands oops with m41t80

2016-10-24 Thread Julian Margetson


On 21-Oct-16 6:56 AM, Julian Margetson wrote:
Been having an oops since kernel 4.6 with m41t80 on Sam460ex 
canyonlands board .


Regards

Julian Margetson



Can't complete a bisect.

too many different issues up to to 4.6.0

Re: [PATCH v4 2/5] drivers/of: do not add memory for unavailable nodes

2016-10-24 Thread Michael Ellerman

Alistair Popple  writes:

> Hi Reza,
>
> On Thu, 6 Oct 2016 01:36:32 PM Reza Arbab wrote:
>> Respect the standard dt "status" property when scanning memory nodes in
>> early_init_dt_scan_memory(), so that if the node is unavailable, no
>> memory will be added.
>
> What happens if a kernel without this patch is booted on a system with some 
> status="disabled" device-nodes? Do older kernels just ignore this memory or 
> do 
> they try to use it?
>
> From what I can tell it seems that kernels without this patch will try and 
> use 
> this memory even if it is marked in the device-tree as status="disabled" 
> which 
> could lead to problems for older kernels when we start exporting this 
> property 
> from firmware.

The code already looks for "linux,usable-memory" in preference to "reg".
Can you use that instead?

That would have the advantage that existing kernels already understand
it.

Another problem with using "status" is we could have device trees out
there that have status = disabled and we don't know about it, and by
changing the kernel to use that property we break people's systems.
Though for memory nodes my guess is that's not true, but you never know ...

cheers

Re: [PATCH] kernel: irq: fix build failure

2016-10-24 Thread Lee Jones

On Fri, 21 Oct 2016, Thomas Gleixner wrote:

> On Fri, 21 Oct 2016, Stephen Rothwell wrote:
> > On Thu, 20 Oct 2016 14:55:45 +0200 (CEST) Thomas Gleixner 
> >  wrote:
> > > I know. This is under discussion with the driver folks as we are not going
> > > to blindly export stuff just because someone slapped a irq_set_parent()
> > > into the code w/o knowing why.
> > 
> > Do we have any idea if a resolution is close.  This was first reported
> > in linux-next in September 14/15.  :-(
> 
> Grr. Yes. As much as I hate it, I'll go and export it for now. Should be
> able to get it into rc2.

Did this get in?  I still have people complaining about it.

-- 
Lee Jones
Linaro STMicroelectronics Landing Team Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

Re: [PATCH V2 3/5] arm:dt:ls1021a: Add TMU device tree support for LS1021A

2016-10-24 Thread Shawn Guo

On Sun, Oct 09, 2016 at 02:47:04PM +0800, Jia Hongtao wrote:
> From: Hongtao Jia 
> 
> Also add nodes and properties for thermal management support.
> 
> Signed-off-by: Jia Hongtao 

For patch #3 ~ #5, I updated the subject prefix a bit and applied.

Shawn

Re: [PATCH 2/5] stop_machine: yield CPU during stop machine

2016-10-24 Thread Peter Zijlstra

On Mon, Oct 24, 2016 at 09:52:31AM +0200, Christian Borntraeger wrote:
> Peter, I will fixup the patch set (I forgot to remove the lowlatency
> in 2 places) and push it on my tree for linux-next. Lets see what happens.
> Would the tip tree be the right place if things work out ok?

I think so, you're touching a fair bit of kernel/locking/ and there's
bound to be some conflicts with work there. So carrying it in the
locking tree might be best.

Re: [PATCH V2 1/5] powerpc/mpc85xx: Update TMU device tree node for T1040/T1042

2016-10-24 Thread Shawn Guo

On Sun, Oct 09, 2016 at 02:47:02PM +0800, Jia Hongtao wrote:
> From: Hongtao Jia 
> 
> SoC compatible string and endianness property are added according to the
> new bindings.

The commit log doesn't seem to match the actual changes.  Same for patch
2/5.

Shawn

> 
> Signed-off-by: Jia Hongtao 
> ---
> Changes for V2:
> * Rebase on latest linux-next tree (next-20161006).
> 
>  arch/powerpc/boot/dts/fsl/t1040si-post.dtsi | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi 
> b/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
> index 44e399b..145c7f4 100644
> --- a/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
> +++ b/arch/powerpc/boot/dts/fsl/t1040si-post.dtsi
> @@ -526,7 +526,7 @@
>  
>  0x0003 0x0012
>  0x00030001 0x001d>;
> - #thermal-sensor-cells = <0>;
> + #thermal-sensor-cells = <1>;
>   };
>  
>   thermal-zones {
> @@ -534,7 +534,7 @@
>   polling-delay-passive = <1000>;
>   polling-delay = <5000>;
>  
> - thermal-sensors = <>;
> + thermal-sensors = < 2>;
>  
>   trips {
>   cpu_alert: cpu-alert {
> -- 
> 2.1.0.27.g96db324
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check

2016-10-24 Thread xinhui


This is new version for [PATCH v6 9/9] Documentation: virtual: kvm: Support 
vcpu preempted check
change:
an explicit pad[3] after __u8 preempted.
a typo fix in the commit log.

From defac64d7c6a50d5f18ef64a7c776af3e21e8b68 Mon Sep 17 00:00:00 2001
From: Pan Xinhui 
Date: Thu, 20 Oct 2016 09:33:36 -0400
Subject: [PATCH v6 9/9] Documentation: virtual: kvm: Support vcpu preempted 
check

Commit ("x86, kvm: support vcpu preempted check") add one field "__u8
preempted" into struct kvm_steal_time. This field tells if one vcpu is
running or not.

It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is
not preempted. Other values mean the vcpu has been preempted.

Signed-off-by: Pan Xinhui 
Acked-by: Radim Krčmář 
---
 Documentation/virtual/kvm/msr.txt | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index 2a71c8f..ab2ab76 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -208,7 +208,9 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
__u64 steal;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u8  preempted;
+   __u8  u8_pad[3];
+   __u32 pad[11];
}
 
 	whose data will be filled in by the hypervisor periodically. Only one

@@ -232,6 +234,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
nanoseconds. Time during which the vcpu is idle, will not be
reported as steal time.
 
+		preempted: indicate the VCPU who owns this struct is running or

+   not. Non-zero values mean the VCPU has been preempted. Zero
+   means the VCPU is not preempted. NOTE, it is always zero if the
+   the hypervisor doesn't support this field.
+
 MSR_KVM_EOI_EN: 0x4b564d04
data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
when disabled.  Bit 1 is reserved and must be zero.  When PV end of
--
2.4.11

Re: [PATCH v5 6/9] x86, kvm: support vcpu preempted check

2016-10-24 Thread xinhui


This is new version for [PATCH v6 6/9] x86, kvm: support vcpu preempted check
change:
an explicit pad[3] after __u8 preempted.

From b876ea1a2a724c004b543b2c103a1f8faa5f106e Mon Sep 17 00:00:00 2001
From: Pan Xinhui 
Date: Thu, 20 Oct 2016 08:14:41 -0400
Subject: [PATCH v6 6/9] x86, kvm: support vcpu preempted check

Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

Use one field of struct kvm_steal_time to indicate that if one vcpu
is running or not.

unix benchmark result:
host:  kernel 4.8.1, i5-4570, 4 cpus
guest: kernel 4.8.1, 8 vcpus

test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps
Process Creation   |29881.2 lps  |28572.8 lps
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps

Signed-off-by: Pan Xinhui 
---
 arch/x86/include/uapi/asm/kvm_para.h |  4 +++-
 arch/x86/kernel/kvm.c| 12 
 arch/x86/kvm/x86.c   | 18 ++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 94dc8ca..1421a65 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -45,7 +45,9 @@ struct kvm_steal_time {
__u64 steal;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u8  preempted;
+   __u8  u8_pad[3];
+   __u32 pad[11];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index edbbfc8..0b48dd2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -415,6 +415,15 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
 }
 
+static bool kvm_vcpu_is_preempted(int cpu)

+{
+   struct kvm_steal_time *src;
+
+   src = _cpu(steal_time, cpu);
+
+   return !!src->preempted;
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -471,6 +480,9 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
has_steal_clock = 1;
pv_time_ops.steal_clock = kvm_steal_clock;
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+   pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted;
+#endif
}
 
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c633de..a627537 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
>arch.st.steal, sizeof(struct kvm_steal_time
return;
 
+	vcpu->arch.st.steal.preempted = 0;

+
if (vcpu->arch.st.steal.version & 1)
vcpu->arch.st.steal.version += 1;  /* first time write, random 
junk */
 
@@ -2810,8 +2812,24 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 }
 
+static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)

+{
+   if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
+   return;
+
+   if (unlikely(kvm_read_guest_cached(vcpu->kvm, >arch.st.stime,
+   >arch.st.steal, sizeof(struct kvm_steal_time
+   return;
+
+   vcpu->arch.st.steal.preempted = 1;
+
+   kvm_write_guest_cached(vcpu->kvm, >arch.st.stime,
+   >arch.st.steal, sizeof(struct kvm_steal_time));
+}
+
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+   kvm_steal_time_set_preempted(vcpu);
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = rdtsc();
--
2.4.11

Re: [PATCH 2/5] stop_machine: yield CPU during stop machine

2016-10-24 Thread Christian Borntraeger

On 10/22/2016 02:06 AM, Nicholas Piggin wrote:
> On Fri, 21 Oct 2016 14:05:36 +0200
> Peter Zijlstra  wrote:
> 
>> On Fri, Oct 21, 2016 at 01:58:55PM +0200, Christian Borntraeger wrote:
>>> stop_machine can take a very long time if the hypervisor does
>>> overcommitment for guest CPUs. When waiting for "the one", lets
>>> give up our CPU by using the new cpu_relax_yield.  
>>
>> This seems something that would apply to most other virt stuff. Lets Cc
>> a few more lists for that.
>>
>>> Signed-off-by: Christian Borntraeger 
>>> ---
>>>  kernel/stop_machine.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
>>> index ec9ab2f..1eb8266 100644
>>> --- a/kernel/stop_machine.c
>>> +++ b/kernel/stop_machine.c
>>> @@ -194,7 +194,7 @@ static int multi_cpu_stop(void *data)
>>> /* Simple state machine */
>>> do {
>>> /* Chill out and ensure we re-read multi_stop_state. */
>>> -   cpu_relax();
>>> +   cpu_relax_yield();
>>> if (msdata->state != curstate) {
>>> curstate = msdata->state;
>>> switch (curstate) {
>>> -- 
>>> 2.5.5
>>>   
> 
> This is the only caller of cpu_relax_yield()?

As of today yes. Right now the yielding (call to hypervisor) in 
cpu_relax is only done for s390. Some time ago Heiko did remove 
that also from s390 with commit 57f2ffe14fd125c2 ("s390: remove 
diag 44 calls from cpu_relax()")

As it turns out this make stop_machine run really slow on virtualized
systems. For example the kprobes test during bootup took several seconds 
instead of just running unnoticed with large guests. Therefore, we 
reintroduced that with commit 4d92f50249eb ("s390: reintroduce diag 44
calls for cpu_relax()"), but the only place where we noticed the missing
yield was in the stop_machine code.

I would assume that we might find some other places where this makes
sense in the future, but I expect that we have much less places for 
yield than we need for lowlatency.

PS: We do something similar for our arch implementation for spinlocks,
but there  we use the directed yield as we know which CPU holds the lock.

> 
> As a step to removing cpu_yield_lowlatency this series is nice so I
> have no objection. But "general" kernel coders still have basically
> no chance of using this properly.
> 
> I wonder what can be done about that. I've got that spin_do/while
> series I'll rebase on top of this, but a spin_yield variant of them
> is of no more help to the caller.
> 
> What makes this unique? Long latency and not performance critical?

I think what makes this unique is that ALL cpus spin and wait for one.
It was really the only place that I noticed a regression with Heikos
first patch.

> Most places where we spin and maybe yield have been moved to arch
> code, but I wonder whether we can make an easier to use architecture
> independent API?

Peter, I will fixup the patch set (I forgot to remove the lowlatency
in 2 places) and push it on my tree for linux-next. Lets see what happens.
Would the tip tree be the right place if things work out ok?

Re: [PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container

2016-10-24 Thread Nicholas Piggin

On Mon, 24 Oct 2016 17:53:09 +1100
Alexey Kardashevskiy  wrote:

> In some situations the userspace memory context may live longer than
> the userspace process itself so if we need to do proper memory context
> cleanup, we better have tce_container take a reference to mm_struct and
> use it later when the process is gone (@current or @current->mm is NULL).
> 
> This references mm and stores the pointer in the container; this is done
> when a container is just created so checking for !current->mm in other
> places becomes pointless.
> 
> This replaces current->mm with container->mm everywhere except debug
> prints.
> 
> This adds a check that current->mm is the same as the one stored in
> the container to prevent userspace from making changes to a memory
> context of other processes; in order to add this check,
> VFIO_CHECK_EXTENSION is moved out of the switch(cmd) as it is
> quite special anyway - it is the only ioctl() called when neither
> container nor container->mm is initialized.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v4:
> * added check for container->mm!=current->mm in tce_iommu_ioctl()
> for all ioctls and removed other redundand checks

[...]

> @@ -326,13 +314,18 @@ static void *tce_iommu_open(unsigned long arg)
>  
>   container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>  
> + /* current->mm cannot be NULL in this context */
> + container->mm = current->mm;
> + atomic_inc(>mm->mm_count);

[...]

> @@ -733,7 +728,13 @@ static long tce_iommu_ioctl(void *iommu_data,
>   }
>  
>   return (ret < 0) ? 0 : ret;
> + }
>  
> + /* tce_iommu_open() initializes container->mm so it can't be NULL here 
> */
> + if (container->mm != current->mm)
> + return -ESRCH;
> +
> + switch (cmd) {
>   case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
>   struct vfio_iommu_spapr_tce_info info;
>   struct tce_iommu_group *tcegrp;

I think doing the mm checks like this is a great improvement.

Reviewed-by: Nicholas Piggin 

Thanks,
Nick

[PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container

2016-10-24 Thread Alexey Kardashevskiy

In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better have tce_container take a reference to mm_struct and
use it later when the process is gone (@current or @current->mm is NULL).

This references mm and stores the pointer in the container; this is done
when a container is just created so checking for !current->mm in other
places becomes pointless.

This replaces current->mm with container->mm everywhere except debug
prints.

This adds a check that current->mm is the same as the one stored in
the container to prevent userspace from making changes to a memory
context of other processes; in order to add this check,
VFIO_CHECK_EXTENSION is moved out of the switch(cmd) as it is
quite special anyway - it is the only ioctl() called when neither
container nor container->mm is initialized.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v4:
* added check for container->mm!=current->mm in tce_iommu_ioctl()
for all ioctls and removed other redundand checks
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 131 ++--
 1 file changed, 66 insertions(+), 65 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index d0c38b2..81ab93f 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -31,49 +31,46 @@
 static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group);
 
-static long try_increment_locked_vm(long npages)
+static long try_increment_locked_vm(struct mm_struct *mm, long npages)
 {
long ret = 0, locked, lock_limit;
 
-   if (!current || !current->mm)
-   return -ESRCH; /* process exited */
-
if (!npages)
return 0;
 
-   down_write(>mm->mmap_sem);
-   locked = current->mm->locked_vm + npages;
+   down_write(>mmap_sem);
+   locked = mm->locked_vm + npages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
ret = -ENOMEM;
else
-   current->mm->locked_vm += npages;
+   mm->locked_vm += npages;
 
pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
npages << PAGE_SHIFT,
-   current->mm->locked_vm << PAGE_SHIFT,
+   mm->locked_vm << PAGE_SHIFT,
rlimit(RLIMIT_MEMLOCK),
ret ? " - exceeded" : "");
 
-   up_write(>mm->mmap_sem);
+   up_write(>mmap_sem);
 
return ret;
 }
 
-static void decrement_locked_vm(long npages)
+static void decrement_locked_vm(struct mm_struct *mm, long npages)
 {
-   if (!current || !current->mm || !npages)
-   return; /* process exited */
+   if (!npages)
+   return;
 
-   down_write(>mm->mmap_sem);
-   if (WARN_ON_ONCE(npages > current->mm->locked_vm))
-   npages = current->mm->locked_vm;
-   current->mm->locked_vm -= npages;
+   down_write(>mmap_sem);
+   if (WARN_ON_ONCE(npages > mm->locked_vm))
+   npages = mm->locked_vm;
+   mm->locked_vm -= npages;
pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
npages << PAGE_SHIFT,
-   current->mm->locked_vm << PAGE_SHIFT,
+   mm->locked_vm << PAGE_SHIFT,
rlimit(RLIMIT_MEMLOCK));
-   up_write(>mm->mmap_sem);
+   up_write(>mmap_sem);
 }
 
 /*
@@ -98,6 +95,7 @@ struct tce_container {
bool enabled;
bool v2;
unsigned long locked_pages;
+   struct mm_struct *mm;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct list_head group_list;
 };
@@ -107,17 +105,14 @@ static long tce_iommu_unregister_pages(struct 
tce_container *container,
 {
struct mm_iommu_table_group_mem_t *mem;
 
-   if (!current || !current->mm)
-   return -ESRCH; /* process exited */
-
if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
return -EINVAL;
 
-   mem = mm_iommu_find(current->mm, vaddr, size >> PAGE_SHIFT);
+   mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
if (!mem)
return -ENOENT;
 
-   return mm_iommu_put(current->mm, mem);
+   return mm_iommu_put(container->mm, mem);
 }
 
 static long tce_iommu_register_pages(struct tce_container *container,
@@ -127,14 +122,11 @@ static long tce_iommu_register_pages(struct tce_container 
*container,
struct mm_iommu_table_group_mem_t *mem = NULL;
unsigned long entries = size >> PAGE_SHIFT;
 
-   if (!current || !current->mm)
-   return -ESRCH; /* process exited */
-
if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
((vaddr + size) < vaddr))

[PATCH kernel v4 2/4] powerpc/iommu: Stop using @current in mm_iommu_xxx

2016-10-24 Thread Alexey Kardashevskiy

This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.

This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.

This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/mmu_context.h | 16 ++--
 arch/powerpc/mm/mmu_context_iommu.c| 46 +-
 drivers/vfio/vfio_iommu_spapr_tce.c| 14 ---
 3 files changed, 36 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 424844b..b9e3f0a 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -19,16 +19,18 @@ extern void destroy_context(struct mm_struct *mm);
 struct mm_iommu_table_group_mem_t;
 
 extern int isolate_lru_page(struct page *page);/* from internal.h */
-extern bool mm_iommu_preregistered(void);
-extern long mm_iommu_get(unsigned long ua, unsigned long entries,
+extern bool mm_iommu_preregistered(struct mm_struct *mm);
+extern long mm_iommu_get(struct mm_struct *mm,
+   unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem);
-extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
+extern long mm_iommu_put(struct mm_struct *mm,
+   struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
-   unsigned long size);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
-   unsigned long entries);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
+   unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+   unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index ad2e575..4c6db09 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -56,7 +56,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
}
 
pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
-   current->pid,
+   current ? current->pid : 0,
incr ? '+' : '-',
npages << PAGE_SHIFT,
mm->locked_vm << PAGE_SHIFT,
@@ -66,12 +66,9 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
return ret;
 }
 
-bool mm_iommu_preregistered(void)
+bool mm_iommu_preregistered(struct mm_struct *mm)
 {
-   if (!current || !current->mm)
-   return false;
-
-   return !list_empty(>mm->context.iommu_group_mem_list);
+   return !list_empty(>context.iommu_group_mem_list);
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
@@ -124,19 +121,16 @@ static int mm_iommu_move_page_from_cma(struct page *page)
return 0;
 }
 
-long mm_iommu_get(unsigned long ua, unsigned long entries,
+long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long 
entries,
struct mm_iommu_table_group_mem_t **pmem)
 {
struct mm_iommu_table_group_mem_t *mem;
long i, j, ret = 0, locked_entries = 0;
struct page *page = NULL;
 
-   if (!current || !current->mm)
-   return -ESRCH; /* process exited */
-
mutex_lock(_list_mutex);
 
-   list_for_each_entry_rcu(mem, >mm->context.iommu_group_mem_list,
+   list_for_each_entry_rcu(mem, >context.iommu_group_mem_list,
next) {
if ((mem->ua == ua) && (mem->entries == entries)) {
++mem->used;
@@ -154,7 +148,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 
}
 
-   ret = mm_iommu_adjust_locked_vm(current->mm, entries, true);
+   ret = mm_iommu_adjust_locked_vm(mm, entries, true);
if (ret)
goto unlock_exit;
 
@@ -215,11 +209,11 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
mem->entries = entries;
*pmem = mem;
 
-   list_add_rcu(>next, >mm->context.iommu_group_mem_list);
+   list_add_rcu(>next,

[PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

2016-10-24 Thread Alexey Kardashevskiy

At the moment the userspace tool is expected to request pinning of
the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
When the userspace process finishes, all the pinned pages need to
be put; this is done as a part of the userspace memory context (MM)
destruction which happens on the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This moves preregistered regions tracking from MM to VFIO; insteads of
using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
added so each container releases regions which it has pre-registered.

This changes the userspace interface to return EBUSY if a memory
region is already registered in a container. However it should not
have any practical effect as the only userspace tool available now
does register memory region once per container anyway.

As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Nicholas Piggin 
---
Changes:
v4:
* changed tce_iommu_register_pages() to call mm_iommu_find() first and
avoid calling mm_iommu_put() if memory is preregistered already

v3:
* moved tce_iommu_prereg_free() call out of list_for_each_entry()

v2:
* updated commit log
---
 arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
 arch/powerpc/mm/mmu_context_iommu.c| 11 ---
 drivers/vfio/vfio_iommu_spapr_tce.c| 58 +-
 3 files changed, 57 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index ad82735..1a07969 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct 
*mm)
 
 void destroy_context(struct mm_struct *mm)
 {
-#ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_cleanup(mm);
-#endif
-
 #ifdef CONFIG_PPC_ICSWX
drop_cop(mm->context.acop, mm);
kfree(mm->context.cop_lockp);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 4c6db09..104bad0 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
 {
INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list);
 }
-
-void mm_iommu_cleanup(struct mm_struct *mm)
-{
-   struct mm_iommu_table_group_mem_t *mem, *tmp;
-
-   list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list,
-   next) {
-   list_del_rcu(>next);
-   mm_iommu_do_free(mem);
-   }
-}
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 81ab93f..001a488 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -86,6 +86,15 @@ struct tce_iommu_group {
 };
 
 /*
+ * A container needs to remember which preregistered region  it has
+ * referenced to do proper cleanup at the userspace process exit.
+ */
+struct tce_iommu_prereg {
+   struct list_head next;
+   struct mm_iommu_table_group_mem_t *mem;
+};
+
+/*
  * The container descriptor supports only a single group per container.
  * Required by the API as the container is not supplied with the IOMMU group
  * at the moment of initialization.
@@ -98,12 +107,27 @@ struct tce_container {
struct mm_struct *mm;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct list_head group_list;
+   struct list_head prereg_list;
 };
 
+static long tce_iommu_prereg_free(struct tce_container *container,
+   struct tce_iommu_prereg *tcemem)
+{
+   long ret;
+
+   list_del(>next);
+   ret = mm_iommu_put(container->mm, tcemem->mem);
+   kfree(tcemem);
+
+   return ret;
+}
+
 static long tce_iommu_unregister_pages(struct tce_container *container,
__u64 vaddr, __u64 size)
 {
struct mm_iommu_table_group_mem_t *mem;
+   struct tce_iommu_prereg *tcemem;
+   bool found = false;
 
if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
return -EINVAL;
@@ -112,7 +136,17 @@ static long tce_iommu_unregister_pages(struct 
tce_container *container,
if (!mem)
return -ENOENT;
 
-   return mm_iommu_put(container->mm, mem);
+   list_for_each_entry(tcemem, >prereg_list, next) {
+   if (tcemem->mem == mem) {
+   found = true;
+   break;
+   }
+   }
+
+   if

[PATCH kernel v4 1/4] powerpc/iommu: Pass mm_struct to init/cleanup helpers

2016-10-24 Thread Alexey Kardashevskiy

We are going to get rid of @current references in mmu_context_boos3s64.c
and cache mm_struct in the VFIO container. Since mm_context_t does not
have reference counting, we will be using mm_struct which does have
the reference counter.

This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather
than mm_context_t (which is embedded into mm).

This should not cause any behavioral change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/mmu_context.h | 4 ++--
 arch/powerpc/kernel/setup-common.c | 2 +-
 arch/powerpc/mm/mmu_context_book3s64.c | 4 ++--
 arch/powerpc/mm/mmu_context_iommu.c| 9 +
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 5c45114..424844b 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -23,8 +23,8 @@ extern bool mm_iommu_preregistered(void);
 extern long mm_iommu_get(unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
-extern void mm_iommu_init(mm_context_t *ctx);
-extern void mm_iommu_cleanup(mm_context_t *ctx);
+extern void mm_iommu_init(struct mm_struct *mm);
+extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 270ee30..f516ac5 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -915,7 +915,7 @@ void __init setup_arch(char **cmdline_p)
init_mm.context.pte_frag = NULL;
 #endif
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_init(_mm.context);
+   mm_iommu_init(_mm);
 #endif
irqstack_early_init();
exc_lvl_early_init();
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index b114f8b..ad82735 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct 
mm_struct *mm)
mm->context.pte_frag = NULL;
 #endif
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_init(>context);
+   mm_iommu_init(mm);
 #endif
return 0;
 }
@@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct mm_struct 
*mm)
 void destroy_context(struct mm_struct *mm)
 {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_cleanup(>context);
+   mm_iommu_cleanup(mm);
 #endif
 
 #ifdef CONFIG_PPC_ICSWX
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index e0f1c33..ad2e575 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -373,16 +373,17 @@ void mm_iommu_mapped_dec(struct 
mm_iommu_table_group_mem_t *mem)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_mapped_dec);
 
-void mm_iommu_init(mm_context_t *ctx)
+void mm_iommu_init(struct mm_struct *mm)
 {
-   INIT_LIST_HEAD_RCU(>iommu_group_mem_list);
+   INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list);
 }
 
-void mm_iommu_cleanup(mm_context_t *ctx)
+void mm_iommu_cleanup(struct mm_struct *mm)
 {
struct mm_iommu_table_group_mem_t *mem, *tmp;
 
-   list_for_each_entry_safe(mem, tmp, >iommu_group_mem_list, next) {
+   list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list,
+   next) {
list_del_rcu(>next);
mm_iommu_do_free(mem);
}
-- 
2.5.0.rc3

[PATCH kernel v4 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown

2016-10-24 Thread Alexey Kardashevskiy

These patches are to fix a bug when pages stay pinned hours
after QEMU which requested pinning exited.

Please comment. Thanks.

Alexey Kardashevskiy (4):
  powerpc/iommu: Pass mm_struct to init/cleanup helpers
  powerpc/iommu: Stop using @current in mm_iommu_xxx
  vfio/spapr: Reference mm in tce_container
  powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

 arch/powerpc/include/asm/mmu_context.h |  20 ++--
 arch/powerpc/kernel/setup-common.c |   2 +-
 arch/powerpc/mm/mmu_context_book3s64.c |   6 +-
 arch/powerpc/mm/mmu_context_iommu.c|  60 ---
 drivers/vfio/vfio_iommu_spapr_tce.c| 181 ++---
 5 files changed, 154 insertions(+), 115 deletions(-)

-- 
2.5.0.rc3

Re: [PATCH] powerpc/mm: Use tlbiel only if we ever ran on the current cpu

2016-10-24 Thread Aneesh Kumar K.V

Balbir Singh  writes:

> On 24/10/16 14:20, Aneesh Kumar K.V wrote:
>> Before this patch, we used tlbiel, if we ever ran only on this core.
>> That was mostly derived from the nohash usage of the same. But the
>> ISA 3.0 clarifies tlbiel such that
>> 
>> "All TLB entries that have all of the following properties are made
>> invalid on the thread executing the tlbiel instruction"
>> 
>> Hence use tlbiel, if we only ever ran on just the current cpu.
>> 
>
> Could you clarify the impact. The impact I see is that it could
> lead to us thinking we invalidated the TLB across the core whereas
> we did it only on the current thread? This could leave others threads
> in the same core with invalid TLB's, if cpumask reported we ran on
> other threads in the same core?

Correct.

-aneesh

56 matches

Mail list logo