Re: powerpc: Sort the selects under CONFIG_PPC

2017-03-07 Thread Michael Ellerman
On Mon, 2017-03-06 at 12:05:17 UTC, Michael Ellerman wrote:
> We have a big list of selects under CONFIG_PPC, and currently they're
> completely unsorted. This means people tend to add new selects at the
> bottom of the list, and so two commits which both add a new select will
> often conflict.
> 
> Instead sort it alphabetically. This is nicer in and of itself, but also
> means two commits that add a new select will have a greater chance of
> not conflicting.
> 
> Add a note at the top and bottom asking people to keep it sorted.
> 
> And while we're here pad out the 'if' expressions to make them stand
> out.
> 
> Suggested-by: Stephen Rothwell 
> Signed-off-by: Michael Ellerman 

Applied to powerpc fixes.

https://git.kernel.org/powerpc/c/a7d2475af7aedcb9b5c6343989a8bf

cheers


Re: powerpc/64: Fix L1D cache shape vector reporting L1I values

2017-03-07 Thread Michael Ellerman
On Mon, 2017-03-06 at 11:15:29 UTC, Michael Ellerman wrote:
> It seems we didn't pay quite enough attention when testing the new cache
> shape vectors, which means we didn't notice the bug where the vector for
> the L1D was using the L1I values. Fix it, resulting in eg:
> 
>   L1I  cache size: 0x8000  32768B 32K
>   L1I  line size:0x80   8-way associative
>   L1D  cache size:0x1  65536B 64K
>   L1D  line size:0x80   8-way associative
> 
> Fixes: 98a5f361b862 ("powerpc: Add new cache geometry aux vectors")
> Cut-and-paste-bug-by: Benjamin Herrenschmidt 
> Badly-reviewed-by: Michael Ellerman 
> Signed-off-by: Michael Ellerman 

Applied to powerpc fixes.

https://git.kernel.org/powerpc/c/9c7a00868c3a77c86ab07a2c51f3bb

cheers


Re: powerpc: Avoid panic during boot due to divide by zero in init_cache_info()

2017-03-07 Thread Michael Ellerman
On Sat, 2017-03-04 at 23:54:34 UTC, Anton Blanchard wrote:
> From: Anton Blanchard 
> 
> I see a panic in early boot when building with a recent gcc toolchain.
> The issue is a divide by zero, which is undefined. Older toolchains
> let us get away with it:
> 
> int foo(int a) { return a / 0; }
> 
> foo:
>   li 9,0
>   divw 3,3,9
>   extsw 3,3
>   blr
> 
> But newer ones catch it:
> 
> foo:
>   trap
> 
> Add a check to avoid the divide by zero.
> 
> Fixes: bd067f83b084 ("powerpc/64: Fix naming of cache block vs. cache line")
> Signed-off-by: Anton Blanchard 
> Acked-by: Benjamin Herrenschmidt 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/6ba422c75facb1b1e0e206c464ee12

cheers


Re: [v2] powerpc/xics,icp-opal: Fix CPPR setting for icp-opal

2017-03-07 Thread Michael Ellerman
On Fri, 2017-03-03 at 00:58:44 UTC, Balbir Singh wrote:
> CPPR (Current Processor Priority Register) emulation on icp-opal
> uses a single priority in the backend and that can cause CPU
> hotplug to be affected when we try to send an IPI to it.
> 
> The fix is in migrate_irqs_away, the fix does the following:
> 
> 1. It moves the setting of CPPR to after all IRQ migration
>is complete
> 2. In icp-opal, we ignore masking the CPPR when the CPPR
> is set to default priority, anything else we just pass
> through
> 
> Right now this fix is designed with backporting to stable
> in mind
> 
> We'll need a newer version of this later when we merge the
> DEFAULT and IPI priorities to the same value in the kernel.
> 
> Fixes: d74361881f0d ("powerpc/xics: Add ICP OPAL backend")
> Cc: sta...@vger.kernel.org (v4.8+)
> 
> Testing
> 
> 1. I've tested this fix on a virtual partition running under
>kvm
> 2. Gautham and Vaidy have done some testing on an system that
>uses the OPAL backend
> 3. I've also tested this on a system with a native XICS controller
> 
> Suggested-by: Michael Ellerman 
> Reported-by: Vaidyanathan Srinivasan 
> Tested-by: Vaidyanathan Srinivasan 
> Signed-off-by: Balbir Singh 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/a69e2fb70350a66f91175cd2625f1e

cheers


Re: [kernel] powerpc/powernv: Fix clobbered MSR

2017-03-07 Thread Michael Ellerman
On Thu, 2017-03-02 at 06:41:24 UTC, Alexey Kardashevskiy wrote:
> If CONFIG_DEBUG_INFO_SPLIT is not set but CONFIG_DEBUG_INFO is,
> the kernel makefile just adds "-g" and the scripts/gcc-goto.sh test for
> the "asm goto (""  entry)" support succedes and adds
> -DCC_HAVE_ASM_GOTO to KBUILD_CFLAGS/KBUILD_AFLAGS. This effectively
> makes OPAL_BRANCH() a noop (or something similar).
> 
> With CONFIG_DEBUG_INFO_SPLIT=y, the makefile adds "-gsplit-dwarf" which
> somehow makes the scripts/gcc-goto.sh test fail and not have
> CC_HAVE_ASM_GOTO defined so the alternative OPAL_BRANCH() is used
> and this particular chunk clobbers r12 where the parenting code -
> OPAL_CALL() - stores MSR; as the result, the kernel oops'es right after
> early_setup() because of broken MSR.
> 
> This replaces r12 with r11 which is overwritten right after
> OPAL_BRANCH(opal_tracepoint_entry) anyway.
> 
> I used gcc 5.4.1 20161205 built from sha1 ffadbf3ae29.
> 
> Fixes: ab9bad0ead9a ("powerpc/powernv: Remove separate entry for OPAL real 
> mode calls")
> Suggested-by: Paul Mackerras 
> Signed-off-by: Alexey Kardashevskiy 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/2a9c4f40ab2c281f95d41577ff0f7f

cheers


Re: [V3,1/2] powerpc: Parse the command line before calling CAS

2017-03-07 Thread Michael Ellerman
On Tue, 2017-02-28 at 06:03:47 UTC, Suraj Jitindar Singh wrote:
> On POWER9 the hypervisor requires the guest to decide whether it would
> like to use a hash or radix mmu model at the time it calls
> ibm,client-architecture-support (CAS) based on what the hypervisor has
> said it's allowed to do. It is possible to disable radix by passing
> "disable_radix" on the command line. The next patch will add support for
> the new CAS format, thus we need to parse the command line before calling
> CAS so we can correctly select which mmu we would like to use.
> 
> Signed-off-by: Suraj Jitindar Singh 
> Reviewed-by: Paul Mackerras 
> Acked-by: Balbir Singh 

Series applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/12cc9fd6b2d8ee307a735b3b9faed0

cheers


Re: [v2] powernv:idle: Fix bug due to labeling ambiguity in power_enter_stop

2017-03-07 Thread Michael Ellerman
On Mon, 2017-02-27 at 05:40:07 UTC, "Gautham R. Shenoy" wrote:
> From: "Gautham R. Shenoy" 
> 
> Commit 09206b600c76 ("powernv: Pass PSSCR value and mask to
> power9_idle_stop") added additional code in power_enter_stop() to
> distinguish between stop requests whose PSSCR had ESL=EC=1 from those
> which did not. When ESL=EC=1, we do a forward-jump to a location
> labelled by "1", which had the code to handle the ESL=EC=1 case.
> 
> Unforunately just a couple of instructions before this label, is the
> macro IDLE_STATE_ENTER_SEQ() which also has a label "1" in its
> expansion.
> 
> As a result, the current code can result in directly executing stop
> instruction for deep stop requests with PSSCR ESL=EC=1, without saving
> the hypervisor state.
> 
> Fix this BUG by labeling the location that handles ESL=EC=1 case with
> a more descriptive label ".Lhandle_esl_ec_set" (local label suggestion
> a la .Lxx from Anton Blanchard).
> 
> While at it, rename the label "2" labelling the location of the code
> handling entry into deep stop states with ".Lhandle_deep_stop".
> 
> For a good measure, change the label in IDLE_STATE_ENTER_SEQ() macro
> to an not-so commonly used value in order to avoid similar mishaps in
> the future.
> 
> Fixes: 09206b600c76 ("powernv: Pass PSSCR value and mask to
> power9_idle_stop")
> 
> Cc: Michael Neuling 
> Cc: Vaidyanathan Srinivasan 
> Cc: Michael Ellerman 
> Signed-off-by: Gautham R. Shenoy 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/424f8acd328a111319ae30bf384e5d

cheers


Re: powerpc/64: Invalidate process table caching after setting process table

2017-03-07 Thread Michael Ellerman
On Mon, 2017-02-27 at 03:32:41 UTC, Paul Mackerras wrote:
> The POWER9 MMU reads and caches entries from the process table.
> When we kexec from one kernel to another, the second kernel sets
> its process table pointer but doesn't currently do anything to
> make the CPU invalidate any cached entries from the old process table.
> This adds a tlbie (TLB invalidate entry) instruction with parameters
> to invalidate caching of the process table after the new process
> table is installed.
> 
> Signed-off-by: Paul Mackerras 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/7a70d7288c926ae88e0c773fbb506a

cheers


Re: [RESEND] selftest/powerpc/alignment: Fix false failures for skipped tests

2017-03-07 Thread Michael Ellerman
On Sun, 2017-02-26 at 06:08:39 UTC, Sachin Sant wrote:
> Tests under alignment subdirectory are skipped when executed on previous
> generation hardware, but harness still marks them as failed.
> 
> test: test_copy_unaligned
> tags: git_version:unknown
> [SKIP] Test skipped on line 26
> skip: test_copy_unaligned
> selftests: copy_unaligned [FAIL]
> 
> The MAGIC_SKIP_RETURN_VALUE value assigned to rc variable is retained till
> the program exit which causes the test to be marked as failed.
> 
> This patch resets the value before returning to the main() routine.
> With this patch the test o/p is as follows:
> 
> test: test_copy_unaligned
> tags: git_version:unknown
> [SKIP] Test skipped on line 26
> skip: test_copy_unaligned
> selftests: copy_unaligned [PASS]
> 
> Signed-off-by: Sachin Sant 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/a6d8a21596df041f36f4c2ccc260c4

cheers


Re: powerpc: booke: fix boot crash due to null hugepd

2017-03-07 Thread Michael Ellerman
On Thu, 2017-02-16 at 15:11:29 UTC, laurentiu.tu...@nxp.com wrote:
> From: Laurentiu Tudor 
> 
> On 32-bit book-e machines, hugepd_ok() does not take
> into account null hugepd values, causing this crash at boot:
> 
> Unable to handle kernel paging request for data at address 0x8000
> Faulting instruction address: 0xc00182a8
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=24
> CoreNet Generic
> Modules linked in:
> CPU: 1 PID: 1 Comm: swapper/0 Tainted: GW   
> 4.10.0-rc8-00016-g69b1f87 #11
> task: e505 task.stack: e5058000
> NIP: c00182a8 LR: c001829c CTR: 7ffe
> REGS: e5059c50 TRAP: 0300   Tainted: GW
> (4.10.0-rc8-00016-g69b1f87)
> MSR: 00021002 
>   CR: 88428e82  XER: 
> DEAR: 8000 ESR: 
> GPR00: c0107510 e5059d00 e505 8000 bff1 e5059d0c e5059d08 2017
> GPR08:     28428e82  c00027d0 
> GPR16:   88a28e82 2000 48422e82  88a28e84 dd004000
> GPR24: e5059e38   bff1 dd004000 0001 00029002 bff1
> NIP [c00182a8] follow_huge_addr+0x38/0xf0
> LR [c001829c] follow_huge_addr+0x2c/0xf0
> Call Trace:
> [e5059d00] [e5059d00] 0xe5059d00 (unreliable)
> [e5059d20] [c0107510] follow_page_mask+0x40/0x3c0
> [e5059d80] [c0107958] __get_user_pages+0xc8/0x420
> [e5059de0] [c010817c] get_user_pages_remote+0x8c/0x230
> [e5059e30] [c013f170] copy_strings+0x110/0x3a0
> [e5059ea0] [c013f42c] copy_strings_kernel+0x2c/0x50
> [e5059ec0] [c0141324] do_execveat_common+0x474/0x620
> [e5059f10] [c01414fc] do_execve+0x2c/0x40
> [e5059f20] [c0001f68] try_to_run_init_process+0x18/0x60
> [e5059f30] [c000289c] kernel_init+0xcc/0x120
> [e5059f40] [c000f1e8] ret_from_kernel_thread+0x5c/0x64
> Instruction dump:
> bfc10018 7c9f2378 90010024 7fc000a6 7c000146 80630020 38a1000c 38c10008
> 4bfff869 2c03 41c20090 81210008 <8143> 81630004 3860ffea 2f89
> ---[ end trace 4bf94e15fd9fa824 ]---
> 
> This impacts all nxp (ex-freescale) 32-bit booke platforms.
> 
> Fixes: 20717e1ff526 ("powerpc/mm: Fix little-endian 4K hugetlb")
> 
> Reported-by: Madalin-Cristian Bucur 
> Signed-off-by: Laurentiu Tudor 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/3fb66a70a4ae886445743354e4b60e

cheers


Re: [v3, 1/2] powerpc: Emulation support for load/store instructions on LE

2017-03-07 Thread Michael Ellerman
On Tue, 2017-02-14 at 09:16:42 UTC, Ravi Bangoria wrote:
> emulate_step() uses a number of underlying kernel functions that were
> initially not enabled for LE. This has been rectified since. So, fix
> emulate_step() for LE for the corresponding instructions.
> 
> Reported-by: Anton Blanchard 
> Signed-off-by: Ravi Bangoria 

Series applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/e148bd17f48bd17fca2f4f089ec879

cheers


Re: powerpc/64: Fix checksum folding in csum_add

2017-03-07 Thread Michael Ellerman
On Sat, 2017-02-04 at 09:03:40 UTC, Shile Zhang wrote:
> fix the missed point in Paul's patch:
> "powerpc/64: Fix checksum folding in csum_tcpudp_nofold and
> ip_fast_csum_nofold"
> 
> Signed-off-by: Shile Zhang 
> Acked-by: Paul Mackerras 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/6ad966d7303b70165228dba1ee8da1

cheers


Re: [2/3] powerpc: allow compilation on cross-endian toolchain

2017-03-07 Thread Michael Ellerman
On Sun, 2016-11-27 at 02:46:20 UTC, Nicholas Piggin wrote:
> Subject: [PATCH] powerpc: allow compilation on cross-endian toolchain
> 
> GCC can compile with either endian, but the ABI version always
> defaults to the default endian. Alan Modra says:
> 
>   you need both -mbig and -mabi=elfv1 to make a powerpc64le gcc
>   generate powerpc64 code
> 
> The opposite is true for powerpc64 when generating -mlittle it
> requires -mabi=elfv2 to generate v2 ABI. This change adds ABI
> annotations together with endianness. The kernel with ELFv2 ABI
> also uses -mcall-aixdesc, but boot/ does not.
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/4dc831aa88132f835cefe876aa0206

cheers


Re: [PATCH v5 05/15] livepatch/powerpc: add TIF_PATCH_PENDING thread flag

2017-03-07 Thread Michael Ellerman
Josh Poimboeuf  writes:

> Add the TIF_PATCH_PENDING thread flag to enable the new livepatch
> per-task consistency model for powerpc.  The bit getting set indicates
> the thread has a pending patch which needs to be applied when the thread
> exits the kernel.
>
> The bit is included in the _TIF_USER_WORK_MASK macro so that
> do_notify_resume() and klp_update_patch_state() get called when the bit
> is set.
>
> Signed-off-by: Josh Poimboeuf 
> Reviewed-by: Petr Mladek 
> Reviewed-by: Miroslav Benes 
> Reviewed-by: Kamalesh Babulal 
> ---
>  arch/powerpc/include/asm/thread_info.h | 4 +++-
>  arch/powerpc/kernel/signal.c   | 4 
>  2 files changed, 7 insertions(+), 1 deletion(-)

The arch changes here seem fine. I haven't tested the whole series
though.

Acked-by: Michael Ellerman  (powerpc)

cheers


[PATCH kernel v7 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal

2017-03-07 Thread Alexey Kardashevskiy
At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.

As from now on the iommu_free_table() calls it_ops->free(), we need
to have it_ops initialized before calling iommu_free_table() so this
moves this initialization in pnv_pci_ioda2_create_table().

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
Changes:
v5:
* moved "tbl->it_ops = _ioda2_iommu_ops" earlier and updated
the commit log
---
 arch/powerpc/kernel/iommu.c   |  4 
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 --
 drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9bace5df05d5..bc142d87130f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
if (!tbl)
return;
 
+   if (tbl->it_ops->free)
+   tbl->it_ops->free(tbl);
+
if (!tbl->it_map) {
kfree(tbl);
return;
@@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
/* free table */
kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 69c40b43daa3..7916d0cb05fe 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1425,7 +1425,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
*dev, struct pnv_ioda_pe
iommu_group_put(pe->table_group.group);
BUG_ON(pe->table_group.group);
}
-   pnv_pci_ioda2_table_free_pages(tbl);
iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -2041,7 +2040,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
pnv_pci_ioda2_table_free_pages(tbl);
-   iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2318,6 +2316,8 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,
if (!tbl)
return -ENOMEM;
 
+   tbl->it_ops = _ioda2_iommu_ops;
+
ret = pnv_pci_ioda2_table_alloc_pages(nid,
bus_offset, page_shift, window_size,
levels, tbl);
@@ -2326,8 +2326,6 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,
return ret;
}
 
-   tbl->it_ops = _ioda2_iommu_ops;
-
*ptbl = tbl;
 
return 0;
@@ -2368,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct 
pnv_ioda_pe *pe)
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
rc);
-   pnv_ioda2_table_free(tbl);
+   iommu_free_table(tbl, "");
return rc;
}
 
@@ -2456,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct 
iommu_table_group *table_group)
pnv_pci_ioda2_unset_window(>table_group, 0);
if (pe->pbus)
pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
-   pnv_ioda2_table_free(tbl);
+   iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index cf3de91fbfe7..fbec7348a7e5 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container 
*container,
unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
tce_iommu_userspace_view_free(tbl, container->mm);
-   tbl->it_ops->free(tbl);
+   iommu_free_table(tbl, "");
decrement_locked_vm(container->mm, pages);
 }
 
-- 
2.11.0



[PATCH kernel v7 02/10] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()

2017-03-07 Thread Alexey Kardashevskiy
In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/iommu.h  |  7 +++
 arch/powerpc/kernel/iommu.c   | 23 +++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2c1d50792944..4554699aec02 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
long index,
unsigned long *hpa,
enum dma_data_direction *direction);
+   /* Real mode */
+   int (*exchange_rm)(struct iommu_table *tbl,
+   long index,
+   unsigned long *hpa,
+   enum dma_data_direction *direction);
 #endif
void (*clear)(struct iommu_table *tbl,
long index, long npages);
@@ -208,6 +213,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+   unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f202a566ec5..9bace5df05d5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1004,6 +1004,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned 
long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+   unsigned long *hpa, enum dma_data_direction *direction)
+{
+   long ret;
+
+   ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+   if (!ret && ((*direction == DMA_FROM_DEVICE) ||
+   (*direction == DMA_BIDIRECTIONAL))) {
+   struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+   if (likely(pg)) {
+   SetPageDirty(pg);
+   } else {
+   tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+   ret = -EFAULT;
+   }
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index ec58b7f6b6cf..69c40b43daa3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1861,6 +1861,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, 
long index,
 
return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+   unsigned long *hpa, enum dma_data_direction *direction)
+{
+   long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+   if (!ret)
+   pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+   return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1875,6 +1886,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
.exchange = pnv_ioda1_tce_xchg,
+   .exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
.clear = pnv_ioda1_tce_free,
.get = pnv_tce_get,
@@ -1949,7 +1961,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
iommu_table *tbl,
 {
struct iommu_table_group_link *tgl;
 
-   list_for_each_entry_rcu(tgl, >it_group_list, next) {
+   list_for_each_entry_lockless(tgl, >it_group_list, next) {
struct pnv_ioda_pe *pe = container_of(tgl->table_group,
struct pnv_ioda_pe, table_group);
struct pnv_phb *phb = pe->phb;
@@ -2005,6 +2017,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, 
long index,
 
return ret;
 }
+
+static int 

[PATCH kernel v7 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration

2017-03-07 Thread Alexey Kardashevskiy
This is my current queue of patches to add acceleration of TCE
updates in KVM.

This is based on Linus'es tree sha1 c1ae3cfa0e89 v4.11-rc1.

Please comment. Thanks.

Changes:
v7:
* added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c

v6:
* reworked the last patch in terms of error handling and parameters checking

v5:
* replaced "KVM: PPC: Separate TCE validation from update" with
"KVM: PPC: iommu: Unify TCE checking"
* changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table 
disposal"
* reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
* more details in individual commit logs

v4:
* addressed comments from v3
* updated subject lines with correct component names
* regrouped the patchset in order:
- powerpc fixes;
- vfio_spapr_tce driver fixes;
- KVM/PPC fixes;
- KVM+PPC+VFIO;
* everything except last 2 patches have "Reviewed-By: David"

v3:
* there was no full repost, only last patch was posted

v2:
* 11/11 reworked to use new notifiers, it is rather RFC as it still has
a issue;
* got 09/11, 10/11 to use notifiers in 11/11;
* added rb: David to most of patches and added a comment in 05/11.


Alexey Kardashevskiy (10):
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  powerpc/powernv/iommu: Add real mode version of
iommu_table_ops::exchange()
  powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  KVM: PPC: Use preregistered memory API to access TCE list
  KVM: PPC: iommu: Unify TCE checking
  KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/iommu.h   |  32 ++-
 arch/powerpc/include/asm/kvm_host.h|   8 +
 arch/powerpc/include/asm/kvm_ppc.h |  12 +-
 arch/powerpc/include/asm/mmu_context.h |   4 +
 include/uapi/linux/kvm.h   |   9 +
 arch/powerpc/kernel/iommu.c|  86 +---
 arch/powerpc/kvm/book3s_64_vio.c   | 329 -
 arch/powerpc/kvm/book3s_64_vio_hv.c| 292 -
 arch/powerpc/kvm/powerpc.c |   2 +
 arch/powerpc/mm/mmu_context_iommu.c|  39 
 arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
 arch/powerpc/platforms/powernv/pci.c   |   1 +
 arch/powerpc/platforms/pseries/iommu.c |   3 +-
 arch/powerpc/platforms/pseries/vio.c   |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c|   2 +-
 virt/kvm/vfio.c|  60 ++
 arch/powerpc/kvm/Kconfig   |   1 +
 18 files changed, 843 insertions(+), 107 deletions(-)

-- 
2.11.0



[PATCH kernel v7 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-07 Thread Alexey Kardashevskiy
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

If we fail to update a hardware IOMMU table unexpected reason, we just
clear it and move on as there is nothing really we can do about it -
for example, if we hot plug a VFIO device to a guest, existing TCE tables
will be mirrored automatically to the hardware and there is no interface
to report to the guest about possible failures.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

As this creates a descriptor per IOMMU table-LIOBN couple (called
kvmppc_spapr_tce_iommu_table), it is possible to have several
descriptors with the same iommu_table (hardware IOMMU table) attached
to the same LIOBN; we do not remove duplicates though as
iommu_table_ops::exchange not just update a TCE entry (which is
shared among IOMMU groups) but also invalidates the TCE cache
(one per IOMMU group).

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v7:
* added realmode-friendly WARN_ON_ONCE_RM

v6:
* changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
* moved kvmppc_gpa_to_ua() to TCE validation

v5:
* changed error codes in multiple places
* added bunch of WARN_ON() in places which should not really happen
* adde a check that an iommu table is not attached already to LIOBN
* dropped explicit calls to iommu_tce_clear_param_check/
iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
call them anyway (since the previous patch)
* if we fail to update a hardware IOMMU table for unexpected reason,
this just clears the entry

v4:
* added note to the commit log about allowing multiple updates of
the same IOMMU table;
* instead of checking for if any memory was preregistered, this
returns H_TOO_HARD if a specific page was not;
* fixed comments from v3 about error handling in many places;
* simplified TCE handlers and merged IOMMU parts inline - for example,
there used to be kvmppc_h_put_tce_iommu(), now it is merged into
kvmppc_h_put_tce(); this allows to check IOBA boundaries against
the first attached table only (makes the code simpler);

v3:
* simplified not to use VFIO group notifiers
* reworked cleanup, should be cleaner/simpler now

v2:
* reworked to use new VFIO notifiers
* now same iommu_table may appear in the list several times, to be fixed later
---
 Documentation/virtual/kvm/devices/vfio.txt |  22 +-
 arch/powerpc/include/asm/kvm_host.h|   8 +
 arch/powerpc/include/asm/kvm_ppc.h |   4 +
 include/uapi/linux/kvm.h   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c   | 322 -
 arch/powerpc/kvm/book3s_64_vio_hv.c| 190 -
 arch/powerpc/kvm/powerpc.c |   2 +
 virt/kvm/vfio.c|  60 ++
 8 files changed, 611 insertions(+), 5 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..f95d867168ea 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -16,7 +16,25 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+   

[PATCH kernel v7 09/10] KVM: PPC: iommu: Unify TCE checking

2017-03-07 Thread Alexey Kardashevskiy
This reworks helpers for checking TCE update parameters in way they
can be used in KVM.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v6:
* s/tce/gpa/ as TCE without permission bits is a GPA and this is what is
passed everywhere
---
 arch/powerpc/include/asm/iommu.h| 20 +++-
 arch/powerpc/include/asm/kvm_ppc.h  |  6 --
 arch/powerpc/kernel/iommu.c | 37 +
 arch/powerpc/kvm/book3s_64_vio_hv.c | 31 +++
 4 files changed, 39 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 82e77ebf85f4..1e6b03339a68 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -296,11 +296,21 @@ static inline void iommu_restore(void)
 #endif
 
 /* The API to support IOMMU operations for VFIO */
-extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
-   unsigned long ioba, unsigned long tce_value,
-   unsigned long npages);
-extern int iommu_tce_put_param_check(struct iommu_table *tbl,
-   unsigned long ioba, unsigned long tce);
+extern int iommu_tce_check_ioba(unsigned long page_shift,
+   unsigned long offset, unsigned long size,
+   unsigned long ioba, unsigned long npages);
+extern int iommu_tce_check_gpa(unsigned long page_shift,
+   unsigned long gpa);
+
+#define iommu_tce_clear_param_check(tbl, ioba, tce_value, npages) \
+   (iommu_tce_check_ioba((tbl)->it_page_shift,   \
+   (tbl)->it_offset, (tbl)->it_size, \
+   (ioba), (npages)) || (tce_value))
+#define iommu_tce_put_param_check(tbl, ioba, gpa) \
+   (iommu_tce_check_ioba((tbl)->it_page_shift,   \
+   (tbl)->it_offset, (tbl)->it_size, \
+   (ioba), 1) || \
+   iommu_tce_check_gpa((tbl)->it_page_shift, (gpa)))
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index eba8988d8443..72c2a155641f 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -169,8 +169,10 @@ extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
struct kvm *kvm, unsigned long liobn);
-extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
-   unsigned long ioba, unsigned long npages);
+#define kvmppc_ioba_validate(stt, ioba, npages) \
+   (iommu_tce_check_ioba((stt)->page_shift, (stt)->offset, \
+   (stt)->size, (ioba), (npages)) ?\
+   H_PARAMETER : H_SUCCESS)
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
unsigned long tce);
 extern long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d02b8d22fb50..4269f9f1623b 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -960,47 +960,36 @@ void iommu_flush_tce(struct iommu_table *tbl)
 }
 EXPORT_SYMBOL_GPL(iommu_flush_tce);
 
-int iommu_tce_clear_param_check(struct iommu_table *tbl,
-   unsigned long ioba, unsigned long tce_value,
-   unsigned long npages)
+int iommu_tce_check_ioba(unsigned long page_shift,
+   unsigned long offset, unsigned long size,
+   unsigned long ioba, unsigned long npages)
 {
-   /* tbl->it_ops->clear() does not support any value but 0 */
-   if (tce_value)
-   return -EINVAL;
+   unsigned long mask = (1UL << page_shift) - 1;
 
-   if (ioba & ~IOMMU_PAGE_MASK(tbl))
+   if (ioba & mask)
return -EINVAL;
 
-   ioba >>= tbl->it_page_shift;
-   if (ioba < tbl->it_offset)
+   ioba >>= page_shift;
+   if (ioba < offset)
return -EINVAL;
 
-   if ((ioba + npages) > (tbl->it_offset + tbl->it_size))
+   if ((ioba + 1) > (offset + size))
return -EINVAL;
 
return 0;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check);
+EXPORT_SYMBOL_GPL(iommu_tce_check_ioba);
 
-int iommu_tce_put_param_check(struct iommu_table *tbl,
-   unsigned long ioba, unsigned long tce)
+int iommu_tce_check_gpa(unsigned long page_shift, unsigned long gpa)
 {
-   if (tce & ~IOMMU_PAGE_MASK(tbl))
-   return -EINVAL;
-
-   if (ioba & ~IOMMU_PAGE_MASK(tbl))
-   return -EINVAL;
-
-   ioba >>= tbl->it_page_shift;
-   if (ioba < tbl->it_offset)
-  

[PATCH kernel v7 08/10] KVM: PPC: Use preregistered memory API to access TCE list

2017-03-07 Thread Alexey Kardashevskiy
VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL
where declared (not in this patch).

If a requested chunk of memory has not been preregistered, this will
fall back to non-preregistered case and lock rmap.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
Changes:
v4:
* removed oneline inlines
* now falls back to locking rmap if TCE list is not in preregistered memory

v2:
* updated the commit log with David's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 58 +++--
 1 file changed, 42 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 918af76ab2b6..0f145fc7a3a5 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -239,6 +239,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
long i, ret = H_SUCCESS;
unsigned long tces, entry, ua = 0;
unsigned long *rmap = NULL;
+   bool prereg = false;
 
stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
@@ -259,23 +260,47 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if (ret != H_SUCCESS)
return ret;
 
-   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , ))
-   return H_TOO_HARD;
+   if (mm_iommu_preregistered(vcpu->kvm->mm)) {
+   /*
+* We get here if guest memory was pre-registered which
+* is normally VFIO case and gpa->hpa translation does not
+* depend on hpt.
+*/
+   struct mm_iommu_table_group_mem_t *mem;
 
-   rmap = (void *) vmalloc_to_phys(rmap);
+   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , NULL))
+   return H_TOO_HARD;
 
-   /*
-* Synchronize with the MMU notifier callbacks in
-* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-* While we have the rmap lock, code running on other CPUs
-* cannot finish unmapping the host real page that backs
-* this guest real page, so we are OK to access the host
-* real page.
-*/
-   lock_rmap(rmap);
-   if (kvmppc_rm_ua_to_hpa(vcpu, ua, )) {
-   ret = H_TOO_HARD;
-   goto unlock_exit;
+   mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
+   if (mem)
+   prereg = mm_iommu_ua_to_hpa_rm(mem, ua, ) == 0;
+   }
+
+   if (!prereg) {
+   /*
+* This is usually a case of a guest with emulated devices only
+* when TCE list is not in preregistered memory.
+* We do not require memory to be preregistered in this case
+* so lock rmap and do __find_linux_pte_or_hugepte().
+*/
+   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , ))
+   return H_TOO_HARD;
+
+   rmap = (void *) vmalloc_to_phys(rmap);
+
+   /*
+* Synchronize with the MMU notifier callbacks in
+* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+* While we have the rmap lock, code running on other CPUs
+* cannot finish unmapping the host real page that backs
+* this guest real page, so we are OK to access the host
+* real page.
+*/
+   lock_rmap(rmap);
+   if (kvmppc_rm_ua_to_hpa(vcpu, ua, )) {
+   ret = H_TOO_HARD;
+   goto unlock_exit;
+   }
}
 
for (i = 0; i < npages; ++i) {
@@ -289,7 +314,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
}
 
 unlock_exit:
-   unlock_rmap(rmap);
+   if (rmap)
+   unlock_rmap(rmap);
 
return ret;
 }
-- 
2.11.0



[PATCH kernel v7 07/10] KVM: PPC: Pass kvm* to kvmppc_find_table()

2017-03-07 Thread Alexey Kardashevskiy
The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c|  7 ---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++--
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index dd11c4c8c56a..eba8988d8443 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -168,7 +168,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-   struct kvm_vcpu *vcpu, unsigned long liobn);
+   struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 3e26cd4979f9..e96a4590464c 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -214,12 +214,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba, unsigned long tce)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
 
/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
/*  liobn, ioba, tce); */
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -247,7 +248,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
u64 __user *tces;
u64 tce;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -301,7 +302,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
struct kvmppc_spapr_tce_table *stt;
long i, ret;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index e4c4ea973e57..918af76ab2b6 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -48,10 +48,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *  mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
unsigned long liobn)
 {
-   struct kvm *kvm = vcpu->kvm;
struct kvmppc_spapr_tce_table *stt;
 
list_for_each_entry_lockless(stt, >arch.spapr_tce_tables, list)
@@ -182,12 +181,13 @@ EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
 
/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
/*  liobn, ioba, tce); */
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -240,7 +240,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
unsigned long tces, entry, ua = 0;
unsigned long *rmap = NULL;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -301,7 +301,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
struct kvmppc_spapr_tce_table *stt;
long i, ret;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -322,12 +322,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
unsigned long idx;
struct page *page;
u64 *tbl;
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
-- 

[PATCH kernel v7 06/10] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently

2017-03-07 Thread Alexey Kardashevskiy
It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 029be26b5a17..65a471de96de 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -67,6 +67,7 @@ config KVM_BOOK3S_64
select KVM_BOOK3S_64_HANDLER
select KVM
select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+   select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
---help---
  Support running unmodified book3s_64 and book3s_32 guest kernels
  in virtual machines on book3s_64 host processors.
-- 
2.11.0



[PATCH kernel v7 05/10] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number

2017-03-07 Thread Alexey Kardashevskiy
This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f51d5082a377..f5a52ffb6b58 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -883,6 +883,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_MMU_RADIX 134
 #define KVM_CAP_PPC_MMU_HASH_V3 135
 #define KVM_CAP_IMMEDIATE_EXIT 136
+#define KVM_CAP_SPAPR_TCE_VFIO 137
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.11.0



[PATCH kernel v7 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table

2017-03-07 Thread Alexey Kardashevskiy
So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_table_put() and makes
iommu_free_table() static. iommu_table_get() is not used in this patch
but it will be in the following patch.

Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/iommu.h  |  5 +++--
 arch/powerpc/kernel/iommu.c   | 24 +++-
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++---
 arch/powerpc/platforms/powernv/pci.c  |  1 +
 arch/powerpc/platforms/pseries/iommu.c|  3 ++-
 arch/powerpc/platforms/pseries/vio.c  |  2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
 7 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 4554699aec02..82e77ebf85f4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -119,6 +119,7 @@ struct iommu_table {
struct list_head it_group_list;/* List of iommu_table_group_link */
unsigned long *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops;
+   struct krefit_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -151,8 +152,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern void iommu_table_get(struct iommu_table *tbl);
+extern void iommu_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index bc142d87130f..d02b8d22fb50 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid)
return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
unsigned long bitmap_sz;
unsigned int order;
+   struct iommu_table *tbl;
 
-   if (!tbl)
-   return;
+   tbl = container_of(kref, struct iommu_table, it_kref);
 
if (tbl->it_ops->free)
tbl->it_ops->free(tbl);
@@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
 
/* verify that table contains no entries */
if (!bitmap_empty(tbl->it_map, tbl->it_size))
-   pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+   pr_warn("%s: Unexpected TCEs\n", __func__);
 
/* calculate bitmap size in bytes */
bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
/* free table */
kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+void iommu_table_get(struct iommu_table *tbl)
+{
+   kref_get(>it_kref);
+}
+EXPORT_SYMBOL_GPL(iommu_table_get);
+
+void iommu_table_put(struct iommu_table *tbl)
+{
+   if (!tbl)
+   return;
+
+   kref_put(>it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7916d0cb05fe..ec3e565de511 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1425,7 +1425,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
*dev, struct pnv_ioda_pe
iommu_group_put(pe->table_group.group);
BUG_ON(pe->table_group.group);
}
-   iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+   iommu_table_put(tbl);
 }
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
@@ -2226,7 +2226,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb 
*phb,
__free_pages(tce_mem, get_order(tce32_segsz * segs));
if (tbl) {
pnv_pci_unlink_table_and_group(tbl, >table_group);
-   

[PATCH kernel v7 01/10] powerpc/mmu: Add real mode support for IOMMU preregistered memory

2017-03-07 Thread Alexey Kardashevskiy
This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/mmu_context.h |  4 
 arch/powerpc/mm/mmu_context_iommu.c| 39 ++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index b9e3f0aca261..c70c8272523d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -29,10 +29,14 @@ extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+   struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+   unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 497130c5c742..fc67bd766eaf 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -314,6 +314,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct 
mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+   unsigned long ua, unsigned long size)
+{
+   struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+   list_for_each_entry_lockless(mem, >context.iommu_group_mem_list,
+   next) {
+   if ((mem->ua <= ua) &&
+   (ua + size <= mem->ua +
+(mem->entries << PAGE_SHIFT))) {
+   ret = mem;
+   break;
+   }
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
unsigned long ua, unsigned long entries)
 {
@@ -345,6 +364,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t 
*mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+   unsigned long ua, unsigned long *hpa)
+{
+   const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+   void *va = >hpas[entry];
+   unsigned long *pa;
+
+   if (entry >= mem->entries)
+   return -EFAULT;
+
+   pa = (void *) vmalloc_to_phys(va);
+   if (!pa)
+   return -EFAULT;
+
+   *hpa = *pa | (ua & ~PAGE_MASK);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
if (atomic64_inc_not_zero(>mapped))
-- 
2.11.0



Re: [RESEND PATCH] powerpc/pseries: move struct hcall_stats to c file

2017-03-07 Thread Michael Ellerman
"Tobin C. Harding"  writes:

> struct hcall_stats is only used in hvCall_inst.c.
>
> Move struct hcall_stats to hvCall_inst.c
>
> Resolves: #54
> Signed-off-by: Tobin C. Harding 
> ---
>
> Is this correct, adding 'Resolves: #XX' when fixing
> github.com/linuxppc/linux issues?

Not in the change log (the part above ---). That gets committed to the
history, and we don't want to clutter that with github issue numbers.
The kernel history will probably outlive the github issues, at which
point the issue numbers become meaningless.

You can put Resolves: in this section (below ---), if you like. That is
just informational and doesn't get committed.

But I'm also happy to just close the github issues manually. When I do
that I make a link from the issue to the commit, but not the other way
around.

cheers


linux-next: manual merge of the rcu tree with the powerpc-fixes tree

2017-03-07 Thread Stephen Rothwell
Hi Paul,

Today's linux-next merge of the rcu tree got a conflict in:

  arch/powerpc/Kconfig

between commit:

  a7d2475af7ae ("powerpc: Sort the selects under CONFIG_PPC")

from the powerpc-fixes tree and commit:

  9252dd3a96a7 ("rcu: Make arch select smp_mb__after_unlock_lock() strength")

from the rcu tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc arch/powerpc/Kconfig
index 97a8bc8a095c,9fecd004fee8..
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@@ -80,102 -80,95 +80,103 @@@ config ARCH_HAS_DMA_SET_COHERENT_MAS
  config PPC
bool
default y
 -  select BUILDTIME_EXTABLE_SORT
 +  #
 +  # Please keep this list sorted alphabetically.
 +  #
 +  select ARCH_HAS_DEVMEM_IS_ALLOWED
 +  select ARCH_HAS_DMA_SET_COHERENT_MASK
 +  select ARCH_HAS_ELF_RANDOMIZE
 +  select ARCH_HAS_GCOV_PROFILE_ALL
 +  select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE
 +  select ARCH_HAS_SG_CHAIN
 +  select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
 +  select ARCH_HAS_UBSAN_SANITIZE_ALL
 +  select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
 +  select ARCH_SUPPORTS_ATOMIC_RMW
 +  select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
 +  select ARCH_USE_BUILTIN_BSWAP
 +  select ARCH_USE_CMPXCHG_LOCKREF if PPC64
 +  select ARCH_WANT_IPC_PARSE_VERSION
++  select ARCH_WEAK_RELEASE_ACQUIRE
select BINFMT_ELF
 -  select ARCH_HAS_ELF_RANDOMIZE
 -  select OF
 -  select OF_EARLY_FLATTREE
 -  select OF_RESERVED_MEM
 -  select HAVE_FTRACE_MCOUNT_RECORD
 +  select BUILDTIME_EXTABLE_SORT
 +  select CLONE_BACKWARDS
 +  select DCACHE_WORD_ACCESS   if PPC64 && CPU_LITTLE_ENDIAN
 +  select EDAC_ATOMIC_SCRUB
 +  select EDAC_SUPPORT
 +  select GENERIC_ATOMIC64 if PPC32
 +  select GENERIC_CLOCKEVENTS
 +  select GENERIC_CLOCKEVENTS_BROADCASTif SMP
 +  select GENERIC_CMOS_UPDATE
 +  select GENERIC_CPU_AUTOPROBE
 +  select GENERIC_IRQ_SHOW
 +  select GENERIC_IRQ_SHOW_LEVEL
 +  select GENERIC_SMP_IDLE_THREAD
 +  select GENERIC_STRNCPY_FROM_USER
 +  select GENERIC_STRNLEN_USER
 +  select GENERIC_TIME_VSYSCALL_OLD
 +  select HAVE_ARCH_AUDITSYSCALL
 +  select HAVE_ARCH_HARDENED_USERCOPY
 +  select HAVE_ARCH_JUMP_LABEL
 +  select HAVE_ARCH_KGDB
 +  select HAVE_ARCH_SECCOMP_FILTER
 +  select HAVE_ARCH_TRACEHOOK
 +  select HAVE_CBPF_JITif !PPC64
 +  select HAVE_CONTEXT_TRACKINGif PPC64
 +  select HAVE_DEBUG_KMEMLEAK
 +  select HAVE_DEBUG_STACKOVERFLOW
 +  select HAVE_DMA_API_DEBUG
select HAVE_DYNAMIC_FTRACE
 -  select HAVE_DYNAMIC_FTRACE_WITH_REGS if MPROFILE_KERNEL
 -  select HAVE_FUNCTION_TRACER
 +  select HAVE_DYNAMIC_FTRACE_WITH_REGSif MPROFILE_KERNEL
 +  select HAVE_EBPF_JITif PPC64
 +  select HAVE_EFFICIENT_UNALIGNED_ACCESS  if !(CPU_LITTLE_ENDIAN && 
POWER7_CPU)
 +  select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_GRAPH_TRACER
 -  select SYSCTL_EXCEPTION_TRACE
 -  select VIRT_TO_BUS if !PPC64
 +  select HAVE_FUNCTION_TRACER
 +  select HAVE_GCC_PLUGINS
 +  select HAVE_GENERIC_RCU_GUP
 +  select HAVE_HW_BREAKPOINT   if PERF_EVENTS && (PPC_BOOK3S 
|| PPC_8xx)
select HAVE_IDE
select HAVE_IOREMAP_PROT
 -  select HAVE_EFFICIENT_UNALIGNED_ACCESS if !(CPU_LITTLE_ENDIAN && 
POWER7_CPU)
 +  select HAVE_IRQ_EXIT_ON_IRQ_STACK
 +  select HAVE_KERNEL_GZIP
select HAVE_KPROBES
 -  select HAVE_ARCH_KGDB
select HAVE_KRETPROBES
 -  select HAVE_ARCH_TRACEHOOK
 +  select HAVE_LIVEPATCH   if HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_MEMBLOCK
select HAVE_MEMBLOCK_NODE_MAP
 -  select HAVE_DMA_API_DEBUG
 +  select HAVE_MOD_ARCH_SPECIFIC
 +  select HAVE_NMI if PERF_EVENTS
select HAVE_OPROFILE
 -  select HAVE_DEBUG_KMEMLEAK
 -  select ARCH_HAS_SG_CHAIN
 -  select GENERIC_ATOMIC64 if PPC32
 +  select HAVE_OPTPROBES   if PPC64
select HAVE_PERF_EVENTS
 +  select HAVE_PERF_EVENTS_NMI if PPC64
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
 +  select HAVE_RCU_TABLE_FREE  if SMP
select HAVE_REGS_AND_STACK_ACCESS_API
 -  select HAVE_HW_BREAKPOINT if PERF_EVENTS && 

[PATCH] powerpc/boot: Fix zImage TOC alignment

2017-03-07 Thread Michael Ellerman
Recent toolchains force the TOC to be 256 byte aligned. We need to
enforce this alignment in the zImage linker script, otherwise pointers
to our TOC variables (__toc_start) could be incorrect. If the actual
start of the TOC and __toc_start don't have the same value we crash
early in the zImage wrapper.

Cc: sta...@vger.kernel.org
Suggested-by: Alan Modra 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/boot/zImage.lds.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/boot/zImage.lds.S b/arch/powerpc/boot/zImage.lds.S
index 861e72109df2..f080abfc2f83 100644
--- a/arch/powerpc/boot/zImage.lds.S
+++ b/arch/powerpc/boot/zImage.lds.S
@@ -68,6 +68,7 @@ SECTIONS
   }
 
 #ifdef CONFIG_PPC64_BOOT_WRAPPER
+  . = ALIGN(256);
   .got :
   {
 __toc_start = .;
-- 
2.7.4



Re: open list?

2017-03-07 Thread Stephen Rothwell
Hi Tobin,

On Wed, 8 Mar 2017 08:00:29 +1100 "Tobin C. Harding"  wrote:
>
> On Tue, Mar 07, 2017 at 01:16:34PM +0100, Christophe LEROY wrote:
> > 
> > 
> > Le 07/03/2017 à 11:02, Tobin C. Harding a écrit :  
> > >scripts/get_maintainers.pl says this is an open list;
> > >
> > >linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND
> > >64-BIT))
> > >
> > >Patches I've sent with this list cc'd have not been getting through. I
> > >resent one to check if it was a user error at my end.
> > >
> > >This email will obviously serve as another test.
> > >
> > >Is there something I am doing wrong?  
> > 
> > I got both the one you sent yesterday and today's resend.
> > 
> > Both can be seen at 
> > https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=68963
> >   
> 
> thanks Christophe and Daniel for the response. Must be something to do
> with my email sorting setup.

By default, you will not get a copy of your own postings returned to
you by the mailing list ...

The easiest way to check is to look in the mailing list archives.  Or
you can change your options at
https://lists.ozlabs.org/options/linuxppc-dev

-- 
Cheers,
Stephen Rothwell


Re: open list?

2017-03-07 Thread Tobin C. Harding
On Tue, Mar 07, 2017 at 01:16:34PM +0100, Christophe LEROY wrote:
> 
> 
> Le 07/03/2017 à 11:02, Tobin C. Harding a écrit :
> >scripts/get_maintainers.pl says this is an open list;
> >
> >linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND
> >64-BIT))
> >
> >Patches I've sent with this list cc'd have not been getting through. I
> >resent one to check if it was a user error at my end.
> >
> >This email will obviously serve as another test.
> >
> >Is there something I am doing wrong?
> 
> I got both the one you sent yesterday and today's resend.
> 
> Both can be seen at 
> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=68963
> 

thanks Christophe and Daniel for the response. Must be something to do
with my email sorting setup.

thanks,
Tobin.


Re: [PowerPC] XFS : Metadata corruption detected at 0x60000000382100b0, xfs_agf block

2017-03-07 Thread Eric Sandeen
On 3/3/17 3:04 AM, Abdul Haleem wrote:
> Hi,
> 
> Reboot fails for PowerPC machine running RHEL7.3 (3.10.0-514.el7)
> following these messages:

Generally it's best to report RHEL bugs to Red Hat.
 
> SGI XFS with ACLs, security attributes, no debug enabled
> XFS (dm-0): Mounting V5 Filesystem
> FS (dm-0): Starting recovery (logdev: internal)
> XFS (dm-0): Metadata corruption detected at 0x6000382100b0, xfs_agf
> block 0x4b1
> XFS (dm-0): Unmount and run xfs_repair
> XFS (dm-0): First 64 bytes of corrupted metadata buffer:
> c000f06b7200: 58 41 47 46 00 00 00 01 00 00 00 03 00 32 00 00 
> XAGF.2..
> c000f06b7210: 00 00 00 01 00 00 00 02 00 00 00 00 00 00 00 01  
> 
> c000f06b7220: 00 00 00 01 00 00 00 00 00 00 00 76 00 00 00 02  
> ...v
> c000f06b7230: 00 00 00 04 00 2d 0c c7 00 2a 14 62 00 00 00 00  
> .-...*.b
> XFS (dm-0): metadata I/O error: block 0x4b1
> ("xfs_trans_read_buf_map") error 117 numblks 1
> Failed to mount /sysroot.
> 
> Steps to recreate:
> Run some file system test (ltp or xfs/086) on 4.10.0 upstream kernel
> built on the PowerVM LPAR after the test completes, reboot the machine
> back to base kernel, boot falls to dracut mode with above messages,
> every time it requires a xfs_repair to recover the file system.

... though Red Hat might tell you that running upstream kernels is
out of the supported realm.

However, this probably has to do with AGF packing changes.

> c000f06b7200: 58 41 47 46 00 00 00 01 00 00 00 03 00 32 00 00 
> XAGF.2..
magic   version sequencelength
> c000f06b7210: 00 00 00 01 00 00 00 02 00 00 00 00 00 00 00 01  
> 
agf roots   agf levels
> c000f06b7220: 00 00 00 01 00 00 00 00 00 00 00 76 00 00 00 02  
> ...v
flfirst fllast  
> c000f06b7230: 00 00 00 04 00 2d 0c c7 00 2a 14 62 00 00 00 00  
> .-...*.b
flcount freeblocks  longest btreeblks

The AGF verifier does:

if (!(agf->agf_magicnum == cpu_to_be32(XFS_AGF_MAGIC) &&
  XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)) &&
  be32_to_cpu(agf->agf_freeblks) <= be32_to_cpu(agf->agf_length) &&
  be32_to_cpu(agf->agf_flfirst) < XFS_AGFL_SIZE(mp) &&
  be32_to_cpu(agf->agf_fllast) < XFS_AGFL_SIZE(mp) &&
  be32_to_cpu(agf->agf_flcount) <= XFS_AGFL_SIZE(mp)))
return false;

where:

#define XFS_AGFL_SIZE(mp) \
(((mp)->m_sb.sb_sectsize - \
 (xfs_sb_version_hascrc(&((mp)->m_sb)) ? \
sizeof(struct xfs_agfl) : 0)) / \
  sizeof(xfs_agblock_t))

and "struct xfs_agfl" packing has changed between rhel7 and upstream,
due to an early bug related to nailing down that on-disk format :/

Presumably this is what xfs_repair found and fixed?

-Eric


Re: [next 20170227] CPU remove DLPAR operation WARN @ lib/refcount.c:128

2017-03-07 Thread Kees Cook
This is likely a legitimate bug: something took the kref object
negative. (Which was noticed due to the recent migration of kref from
atomic_t to refcount_t which will refuse to perform dangerous
refcounting actions.)

If I had to guess, I think it's dlpar_cpu_exists(), which is calling
of_node_put() on the child. I don't think that should be happening,
but I'm not actually familiar with this code. :)

-Kees

On Mon, Feb 27, 2017 at 1:35 AM, Sachin Sant  wrote:
> With Feb 27 next tree I am seeing inconsistent results on a CPU remove
> DLPAR operation on a POWER8 LPAR.
>
> After the cpu remove operation the SMT capability of the LPAR is disabled.
>
> # uname -r
> 4.10.0-next-20170227
> # ppc64_cpu --smt
> SMT=8
> # lscpu
> Architecture:  ppc64le
> Byte Order:Little Endian
> CPU(s):16
> On-line CPU(s) list:   0-15
> Thread(s) per core:8
> Core(s) per socket:1
> Socket(s): 2
> NUMA node(s):  4
> Model: 2.1 (pvr 004b 0201)
> Model name:POWER8 (architected), altivec supported
> L1d cache: 64K
> L1i cache: 32K
> L2 cache:  512K
> L3 cache:  8192K
> NUMA node0 CPU(s):
> NUMA node1 CPU(s): 0-7
> NUMA node3 CPU(s):
> NUMA node4 CPU(s): 8-15
>
> After a DLPAR operation (CPU remove : 2 to 1) all the cpu seems to be
> removed. at the end of it I also see a warning @lib/refcount.c:128
> SMT capability is show as disabled. It should have remained at 8.
>
> # ppc64_cpu —smt
> Machine is not SMT capable
> lscpu o/p shows 8  online cpus, with threads per core as 8.
>
> [root@alp12 ~]# lscpu
> Architecture:  ppc64le
> Byte Order:Little Endian
> CPU(s):8
> On-line CPU(s) list:   8-15
> Thread(s) per core:8
> Core(s) per socket:1
> Socket(s): 1
> NUMA node(s):  4
> Model: 2.1 (pvr 004b 0201)
> Model name:POWER8 (architected), altivec supported
> L1d cache: 64K
> L1i cache: 32K
> NUMA node0 CPU(s):
> NUMA node1 CPU(s):
> NUMA node3 CPU(s):
> NUMA node4 CPU(s): 8-15
> [root@alp12 ~]
>
> [  196.910677] cpu 8 (hwid 8) Ready to die...
> [  197.120324] cpu 9 (hwid 9) Ready to die...
> [  197.290265] cpu 10 (hwid 10) Ready to die...
> [  197.490234] cpu 11 (hwid 11) Ready to die...
> [  197.630110] cpu 12 (hwid 12) Ready to die...
> [  197.790094] cpu 13 (hwid 13) Ready to die...
> [  197.980016] cpu 14 (hwid 14) Ready to die...
> [  198.098137] cpu 15 (hwid 15) Ready to die...
> [  198.210074] pseries-hotplug-cpu: Failed to release drc (1008) for CPU 
> PowerPC,POWER8, rc: -17
> [  199.050648] cpu 0 (hwid 0) Ready to die...
> [  199.220530] cpu 1 (hwid 1) Ready to die...
> [  199.370459] cpu 2 (hwid 2) Ready to die...
> [  199.600322] cpu 3 (hwid 3) Ready to die...
> [  199.770259] cpu 4 (hwid 4) Ready to die...
> [  199.960189] cpu 5 (hwid 5) Ready to die...
> [  200.140145] cpu 6 (hwid 6) Ready to die...
> [  200.258067] cpu 7 (hwid 7) Ready to die...
> [  200.360320] refcount_t: underflow; use-after-free.
> [  200.360371] [ cut here ]
> [  200.360385] WARNING: CPU: 10 PID: 7194 at lib/refcount.c:128 
> refcount_sub_and_test+0xb8/0xf0
> [  200.360398] Modules linked in: iptable_mangle ipt_MASQUERADE 
> nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
> nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp 
> rpadlpar_io rpaphp tun bridge stp llc kvm iptable_filter vmx_crypto 
> pseries_rng rng_core binfmt_misc nfsd ip_tables x_tables autofs4
> [  200.360472] CPU: 10 PID: 7194 Comm: drmgr Tainted: GW   
> 4.10.0-next-20170227 #3
> [  200.360478] task: c008b7222b00 task.stack: c008b72dc000
> [  200.360483] NIP: c1b6b4b8 LR: c1b6b4b4 CTR: 
> c1cefb50
> [  200.360488] REGS: c008b72df860 TRAP: 0700   Tainted: GW
> (4.10.0-next-20170227)
> [  200.360494] MSR: 80029033 
> [  200.360506]   CR: 22000422  XER: 0007
> [  200.360511] CFAR: c1faf738 SOFTE: 1
> [  200.360511] GPR00: c1b6b4b4 c008b72dfae0 c266c300 
> 0026
> [  200.360511] GPR04: c0050fd8adb0 c0050fda1660 00419000 
> ff00
> [  200.360511] GPR08:  c235143c 00050da4 
> 01d7
> [  200.360511] GPR12:  cea82800  
> 
> [  200.360511] GPR16:    
> 
> [  200.360511] GPR20:    
> 
> [  200.360511] GPR24:  10018430 c005dd05f520 
> c008b72dfe00
> [  200.360511] GPR28:  0016  
> c008b71ffa18
> [  200.360570] NIP [c1b6b4b8] refcount_sub_and_test+0xb8/0xf0

Re: [PATCH 06/18] pstore: Extract common arguments into structure

2017-03-07 Thread Namhyung Kim
On Tue, Mar 7, 2017 at 6:55 AM, Kees Cook  wrote:
> The read/mkfile pair pass the same arguments and should be cleared
> between calls. Move to a structure and wipe it after every loop.
>
> Signed-off-by: Kees Cook 
> ---
>  fs/pstore/platform.c   | 55 
> +++---
>  include/linux/pstore.h | 28 -
>  2 files changed, 57 insertions(+), 26 deletions(-)
>
> diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
> index 320a673ecb5b..3fa1575a6e36 100644
> --- a/fs/pstore/platform.c
> +++ b/fs/pstore/platform.c
> @@ -766,16 +766,9 @@ EXPORT_SYMBOL_GPL(pstore_unregister);
>  void pstore_get_records(int quiet)
>  {
> struct pstore_info *psi = psinfo;
> -   char*buf = NULL;
> -   ssize_t size;
> -   u64 id;
> -   int count;
> -   enum pstore_type_id type;
> -   struct timespec time;
> +   struct pstore_recordrecord = { .psi = psi, };
> int failed = 0, rc;
> -   boolcompressed;
> int unzipped_len = -1;
> -   ssize_t ecc_notice_size = 0;
>
> if (!psi)
> return;
> @@ -784,39 +777,51 @@ void pstore_get_records(int quiet)
> if (psi->open && psi->open(psi))
> goto out;
>
> -   while ((size = psi->read(, , , , , ,
> -_notice_size, psi)) > 0) {
> -   if (compressed && (type == PSTORE_TYPE_DMESG)) {
> +   while ((record.size = psi->read(, ,
> +, ,
> +, ,
> +_notice_size,
> +record.psi)) > 0) {
> +   if (record.compressed &&
> +   record.type == PSTORE_TYPE_DMESG) {
> if (big_oops_buf)
> -   unzipped_len = pstore_decompress(buf,
> -   big_oops_buf, size,
> +   unzipped_len = pstore_decompress(
> +   record.buf,
> +   big_oops_buf,
> +   record.size,
> big_oops_buf_sz);
>
> if (unzipped_len > 0) {
> -   if (ecc_notice_size)
> +   if (record.ecc_notice_size)
> memcpy(big_oops_buf + unzipped_len,
> -  buf + size, ecc_notice_size);
> -   kfree(buf);
> -   buf = big_oops_buf;
> -   size = unzipped_len;
> -   compressed = false;
> +  record.buf + recorrecord.size,

A typo on record.size.

Thanks,
Namhyung


Re: [PATCH 03/18] pstore: Avoid race in module unloading

2017-03-07 Thread Namhyung Kim
Hi Kees,

On Tue, Mar 7, 2017 at 6:55 AM, Kees Cook  wrote:
> Technically, it might be possible for struct pstore_info to go out of
> scope after the module_put(), so report the backend name first.

But in that case, using pstore will crash the kernel anyway, right?
If so, why pstore doesn't keep a reference until unregister?
Do I miss something?

Thanks,
Namhyung


>
> Signed-off-by: Kees Cook 
> ---
>  fs/pstore/platform.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
> index 074fe85a2078..d69ef8a840b9 100644
> --- a/fs/pstore/platform.c
> +++ b/fs/pstore/platform.c
> @@ -722,10 +722,10 @@ int pstore_register(struct pstore_info *psi)
>  */
> backend = psi->name;
>
> -   module_put(owner);
> -
> pr_info("Registered %s as persistent store backend\n", psi->name);
>
> +   module_put(owner);
> +
> return 0;
>  }
>  EXPORT_SYMBOL_GPL(pstore_register);
> --
> 2.7.4
>


Re: [RESEND PATCH 1/6] trace/kprobes: fix check for kretprobe offset within function entry

2017-03-07 Thread Steven Rostedt

Please start a new thread. When sending patches as replies to other
patch threads, especially this deep into the thread, they will most
likely get ignored.

-- Steve



Re: [PATCH v4 1/3] perf: probe: factor out the ftrace README scanning

2017-03-07 Thread Steven Rostedt

FYI,

When creating new patch series, please start a new thread, and don't
post a patch as a reply to another patch. It gets easily lost that way.

-- Steve


On Thu,  2 Mar 2017 23:25:05 +0530
"Naveen N. Rao"  wrote:

> Simplify and separate out the ftrace README scanning logic into a
> separate helper. This is used subsequently to scan for all patterns of
> interest and to cache the result.
> 
> Since we are only interested in availability of probe argument type x,
> we will only scan for that.
> 
> Signed-off-by: Naveen N. Rao 
> ---


[PATCH] powerpc: kprobes: convert __kprobes to NOKPROBE_SYMBOL()

2017-03-07 Thread Naveen N. Rao
Along similar lines as commit 9326638cbee2 ("kprobes, x86: Use
NOKPROBE_SYMBOL() instead of __kprobes annotation"), convert __kprobes
annotation to either NOKPROBE_SYMBOL() or nokprobe_inline. The latter
forces inlining, in which case the caller needs to be added to
NOKPROBE_SYMBOL().

Also:
- blacklist kretprobe_trampoline
- blacklist arch_deref_entry_point, and
- convert a few regular inlines to nokprobe_inline in lib/sstep.c

A key benefit is the ability to detect such symbols as being
blacklisted. Before this patch:

  naveen@ubuntu:~/linux/tools/perf$ sudo cat 
/sys/kernel/debug/kprobes/blacklist | grep read_mem
  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe read_mem
  Failed to write event: Invalid argument
Error: Failed to add events.
  naveen@ubuntu:~/linux/tools/perf$ dmesg | tail -1
  [ 3736.112815] Could not insert probe at _text+10014968: -22

After patch:
  naveen@ubuntu:~/linux/tools/perf$ sudo cat 
/sys/kernel/debug/kprobes/blacklist | grep read_mem
  0xc0072b50-0xc0072d20 read_mem
  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe read_mem
  read_mem is blacklisted function, skip it.
  Added new events:
(null):(null)(on read_mem)
probe:read_mem   (on read_mem)

  You can now use it in all perf tools, such as:

  perf record -e probe:read_mem -aR sleep 1

  naveen@ubuntu:~/linux/tools/perf$ sudo grep " read_mem" /proc/kallsyms
  c0072b50 t read_mem
  c05f3b40 t read_mem
  naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list
  c05f3b48  k  read_mem+0x8[DISABLED]

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/kprobes.c| 56 +--
 arch/powerpc/lib/code-patching.c |  4 +-
 arch/powerpc/lib/sstep.c | 82 +---
 3 files changed, 82 insertions(+), 60 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index fce05a38851c..6d2d464900c4 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -42,7 +42,7 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
 
 struct kretprobe_blackpoint kretprobe_blacklist[] = {{NULL, NULL}};
 
-int __kprobes arch_prepare_kprobe(struct kprobe *p)
+int arch_prepare_kprobe(struct kprobe *p)
 {
int ret = 0;
kprobe_opcode_t insn = *p->addr;
@@ -74,30 +74,34 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
p->ainsn.boostable = 0;
return ret;
 }
+NOKPROBE_SYMBOL(arch_prepare_kprobe);
 
-void __kprobes arch_arm_kprobe(struct kprobe *p)
+void arch_arm_kprobe(struct kprobe *p)
 {
*p->addr = BREAKPOINT_INSTRUCTION;
flush_icache_range((unsigned long) p->addr,
   (unsigned long) p->addr + sizeof(kprobe_opcode_t));
 }
+NOKPROBE_SYMBOL(arch_arm_kprobe);
 
-void __kprobes arch_disarm_kprobe(struct kprobe *p)
+void arch_disarm_kprobe(struct kprobe *p)
 {
*p->addr = p->opcode;
flush_icache_range((unsigned long) p->addr,
   (unsigned long) p->addr + sizeof(kprobe_opcode_t));
 }
+NOKPROBE_SYMBOL(arch_disarm_kprobe);
 
-void __kprobes arch_remove_kprobe(struct kprobe *p)
+void arch_remove_kprobe(struct kprobe *p)
 {
if (p->ainsn.insn) {
free_insn_slot(p->ainsn.insn, 0);
p->ainsn.insn = NULL;
}
 }
+NOKPROBE_SYMBOL(arch_remove_kprobe);
 
-static void __kprobes prepare_singlestep(struct kprobe *p, struct pt_regs 
*regs)
+static nokprobe_inline void prepare_singlestep(struct kprobe *p, struct 
pt_regs *regs)
 {
enable_single_step(regs);
 
@@ -110,37 +114,37 @@ static void __kprobes prepare_singlestep(struct kprobe 
*p, struct pt_regs *regs)
regs->nip = (unsigned long)p->ainsn.insn;
 }
 
-static void __kprobes save_previous_kprobe(struct kprobe_ctlblk *kcb)
+static nokprobe_inline void save_previous_kprobe(struct kprobe_ctlblk *kcb)
 {
kcb->prev_kprobe.kp = kprobe_running();
kcb->prev_kprobe.status = kcb->kprobe_status;
kcb->prev_kprobe.saved_msr = kcb->kprobe_saved_msr;
 }
 
-static void __kprobes restore_previous_kprobe(struct kprobe_ctlblk *kcb)
+static nokprobe_inline void restore_previous_kprobe(struct kprobe_ctlblk *kcb)
 {
__this_cpu_write(current_kprobe, kcb->prev_kprobe.kp);
kcb->kprobe_status = kcb->prev_kprobe.status;
kcb->kprobe_saved_msr = kcb->prev_kprobe.saved_msr;
 }
 
-static void __kprobes set_current_kprobe(struct kprobe *p, struct pt_regs 
*regs,
+static nokprobe_inline void set_current_kprobe(struct kprobe *p, struct 
pt_regs *regs,
struct kprobe_ctlblk *kcb)
 {
__this_cpu_write(current_kprobe, p);
kcb->kprobe_saved_msr = regs->msr;
 }
 
-void __kprobes arch_prepare_kretprobe(struct kretprobe_instance *ri,
- struct pt_regs *regs)
+void arch_prepare_kretprobe(struct kretprobe_instance 

Re: [PATCH] net: toshiba: ps3_genic_net: use new api ethtool_{get|set}_link_ksettings

2017-03-07 Thread Geoff Levand

On 03/05/2017 02:21 PM, Philippe Reynes wrote:

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.


I tested this applied to v4.11-rc1 and it seems to work OK.
Thanks for your contribution.

Tested-by: Geoff Levand 


Re: [PATCH 06/18] pstore: Extract common arguments into structure

2017-03-07 Thread Kees Cook
On Tue, Mar 7, 2017 at 8:22 AM, Namhyung Kim  wrote:
> On Tue, Mar 7, 2017 at 6:55 AM, Kees Cook  wrote:
>> The read/mkfile pair pass the same arguments and should be cleared
>> between calls. Move to a structure and wipe it after every loop.
>>
>> Signed-off-by: Kees Cook 
>> ---
>>  fs/pstore/platform.c   | 55 
>> +++---
>>  include/linux/pstore.h | 28 -
>>  2 files changed, 57 insertions(+), 26 deletions(-)
>>
>> diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
>> index 320a673ecb5b..3fa1575a6e36 100644
>> --- a/fs/pstore/platform.c
>> +++ b/fs/pstore/platform.c
>> @@ -766,16 +766,9 @@ EXPORT_SYMBOL_GPL(pstore_unregister);
>>  void pstore_get_records(int quiet)
>>  {
>> struct pstore_info *psi = psinfo;
>> -   char*buf = NULL;
>> -   ssize_t size;
>> -   u64 id;
>> -   int count;
>> -   enum pstore_type_id type;
>> -   struct timespec time;
>> +   struct pstore_recordrecord = { .psi = psi, };
>> int failed = 0, rc;
>> -   boolcompressed;
>> int unzipped_len = -1;
>> -   ssize_t ecc_notice_size = 0;
>>
>> if (!psi)
>> return;
>> @@ -784,39 +777,51 @@ void pstore_get_records(int quiet)
>> if (psi->open && psi->open(psi))
>> goto out;
>>
>> -   while ((size = psi->read(, , , , , 
>> ,
>> -_notice_size, psi)) > 0) {
>> -   if (compressed && (type == PSTORE_TYPE_DMESG)) {
>> +   while ((record.size = psi->read(, ,
>> +, ,
>> +, ,
>> +_notice_size,
>> +record.psi)) > 0) {
>> +   if (record.compressed &&
>> +   record.type == PSTORE_TYPE_DMESG) {
>> if (big_oops_buf)
>> -   unzipped_len = pstore_decompress(buf,
>> -   big_oops_buf, size,
>> +   unzipped_len = pstore_decompress(
>> +   record.buf,
>> +   big_oops_buf,
>> +   record.size,
>> big_oops_buf_sz);
>>
>> if (unzipped_len > 0) {
>> -   if (ecc_notice_size)
>> +   if (record.ecc_notice_size)
>> memcpy(big_oops_buf + unzipped_len,
>> -  buf + size, ecc_notice_size);
>> -   kfree(buf);
>> -   buf = big_oops_buf;
>> -   size = unzipped_len;
>> -   compressed = false;
>> +  record.buf + recorrecord.size,
>
> A typo on record.size.

Thanks! Yeah, 0-day noticed this too. I've refreshed the patches in my
tree with the correction now.

-Kees

-- 
Kees Cook
Pixel Security


[PATCH v2 6/6] perf: powerpc: choose local entry point with kretprobes

2017-03-07 Thread Naveen N. Rao
perf now uses an offset from _text/_stext for kretprobes if the kernel
supports it, rather than the actual function name. As such, let's choose
the LEP for powerpc ABIv2 so as to ensure the probe gets hit. Do it only
if the kernel supports specifying offsets with kretprobes.

Signed-off-by: Naveen N. Rao 
---
Changes:
- updated to address build issues due to dropping patch 5/6.

 tools/perf/arch/powerpc/util/sym-handling.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/tools/perf/arch/powerpc/util/sym-handling.c 
b/tools/perf/arch/powerpc/util/sym-handling.c
index 1030a6e504bb..b8cbefc49aca 100644
--- a/tools/perf/arch/powerpc/util/sym-handling.c
+++ b/tools/perf/arch/powerpc/util/sym-handling.c
@@ -10,6 +10,7 @@
 #include "symbol.h"
 #include "map.h"
 #include "probe-event.h"
+#include "probe-file.h"
 
 #ifdef HAVE_LIBELF_SUPPORT
 bool elf__needs_adjust_symbols(GElf_Ehdr ehdr)
@@ -79,11 +80,16 @@ void arch__fix_tev_from_maps(struct perf_probe_event *pev,
 * However, if the user specifies an offset, we fall back to using the
 * GEP since all userspace applications (objdump/readelf) show function
 * disassembly with offsets from the GEP.
-*
-* In addition, we shouldn't specify an offset for kretprobes.
 */
-   if (pev->point.offset || (!pev->uprobes && pev->point.retprobe) ||
-   !map || !sym)
+   if (pev->point.offset || !map || !sym)
+   return;
+
+   /* For kretprobes, add an offset only if the kernel supports it */
+   if (!pev->uprobes && pev->point.retprobe
+#ifdef HAVE_LIBELF_SUPPORT
+   && !kretprobe_offset_is_supported()
+#endif
+   )
return;
 
lep_offset = PPC64_LOCAL_ENTRY_OFFSET(sym->arch_sym);
-- 
2.11.1



Re: [PATCH 5/6] perf: probes: move ftrace README parsing logic into trace-event-parse.c

2017-03-07 Thread Naveen N. Rao
On 2017/03/07 04:51PM, Masami Hiramatsu wrote:
> On Tue,  7 Mar 2017 16:17:40 +0530
> "Naveen N. Rao"  wrote:
> 
> > probe-file.c needs libelf, but scanning ftrace README does not require
> > that. As such, move the ftrace README scanning logic out of probe-file.c
> > and into trace-event-parse.c.
> 
> As far as I can see, there is no reason to push this out from probe-file.c
> because anyway this API using code requires libelf. Without this, I can
> still build perf with NO_LIBELF=1. So I wouldn't like to pick this.
> (I think we can drop this from this series)

Ok. We can drop this. I'll rework patch 6/6.

Thanks,
Naveen



4.11.0-rc1 boot resulted in WARNING: CPU: 14 PID: 1722 at fs/sysfs/dir.c:31 .sysfs_warn_dup+0x78/0xb0

2017-03-07 Thread Abdul Haleem

Hi,

Today's mainline (4.11.0-rc1) booted with warnings on Power7 LPAR.

Issue is not reproducible all the time.

traces:

Found device VDASD 5.
Mounting /home...
Reached target Swap.
Found device VDASD 2.

Mounting /boot...

sysfs: cannot create duplicate filename '/fs/xfs/sda'
[ cut here ]
WARNING: CPU: 14 PID: 1722 at fs/sysfs/dir.c:31 .sysfs_warn_dup
+0x78/0xb0
Modules linked in: sg(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E)
grace(E) sunrpc(E) binfmt_misc(E) ip_tables(E) ext4(E) mbcache(E)
jbd2(E) sd_mod(E) ibmvscsi(E) ibmveth(E) scsi_transport_srp(E)
CPU: 14 PID: 1722 Comm: mount Tainted: GW   E   4.11.0-rc1-autotest #1
task: c009ed3f9c80 task.stack: c009ed43
NIP: c03a6c68 LR: c03a6c64 CTR: 01764c5c
REGS: c009ed4333c0 TRAP: 0700   Tainted: GW   E
(4.11.0-rc1-autotest)
MSR: 8282b032 
  CR: 22022822  XER: 0006
CFAR: c0994958 SOFTE: 1
GPR00: c03a6c64 c009ed433640 c138a500 0035
GPR04: c009ff88ada0 c009ff8a1658 014cfc2c 
GPR08:  c0dd146c 0009feac 3fef
GPR12: 22022844 ce9f7e00 37409c40 37409c30
GPR16: 37409c28   3741f1c8
GPR20: 0100147d1270  c0ed 3fffa0f53384
GPR24: c394d178 c0c054c0 c1742e68 c000ff3ea640
GPR28: c394d640 c013eec28c28 c009f2ab0148 c3833000
NIP [c03a6c68] .sysfs_warn_dup+0x78/0xb0
LR [c03a6c64] .sysfs_warn_dup+0x74/0xb0
Call Trace:
[c009ed433640] [c03a6c64] .sysfs_warn_dup+0x74/0xb0 (unreliable)
[c009ed4336d0] [c03a6de4] .sysfs_create_dir_ns+0xc4/0xd0
[c009ed433760] [c0551048] .kobject_add_internal+0xd8/0x450
[c009ed433800] [c055141c] .kobject_init_and_add+0x5c/0x90
[c009ed433890] [c044ac54] .xfs_mountfs+0x224/0xa30
[c009ed433960] [c0452a90] .xfs_fs_fill_super+0x490/0x620
[c009ed433a10] [c02fefc0] .mount_bdev+0x220/0x260
[c009ed433ac0] [c04509b8] .xfs_fs_mount+0x18/0x30
[c009ed433b30] [c0300520] .mount_fs+0x70/0x210
[c009ed433bf0] [c0325930] .vfs_kern_mount+0x60/0x1c0
[c009ed433cb0] [c032a458] .do_mount+0x268/0xee0
[c009ed433d90] [c032b4ec] .SyS_mount+0x8c/0x100
[c009ed433e30] [c000b184] system_call+0x38/0xe0
Instruction dump:
7fa3eb78 3880 7fe5fb78 38c01000 4bffa929 6000 3c62ff8b 7fe4fb78
38636ec8 7fc5f378 485edcb9 6000 <0fe0> 7fe3fb78 4bf1fb01 6000
---[ end trace 78f08bafbc2388f3 ]---
kobject_add_internal failed for sda with -EEXIST, don't try to register
things with the same name in the same directory.


-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre





Re: [PATCH 03/18] pstore: Avoid race in module unloading

2017-03-07 Thread Kees Cook
On Tue, Mar 7, 2017 at 8:16 AM, Namhyung Kim  wrote:
> Hi Kees,
>
> On Tue, Mar 7, 2017 at 6:55 AM, Kees Cook  wrote:
>> Technically, it might be possible for struct pstore_info to go out of
>> scope after the module_put(), so report the backend name first.
>
> But in that case, using pstore will crash the kernel anyway, right?
> If so, why pstore doesn't keep a reference until unregister?
> Do I miss something?

I could be wrong with this, since the backend can't call unregister
until register has finished... I'll drop this patch.

-Kees

>
> Thanks,
> Namhyung
>
>
>>
>> Signed-off-by: Kees Cook 
>> ---
>>  fs/pstore/platform.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
>> index 074fe85a2078..d69ef8a840b9 100644
>> --- a/fs/pstore/platform.c
>> +++ b/fs/pstore/platform.c
>> @@ -722,10 +722,10 @@ int pstore_register(struct pstore_info *psi)
>>  */
>> backend = psi->name;
>>
>> -   module_put(owner);
>> -
>> pr_info("Registered %s as persistent store backend\n", psi->name);
>>
>> +   module_put(owner);
>> +
>> return 0;
>>  }
>>  EXPORT_SYMBOL_GPL(pstore_register);
>> --
>> 2.7.4
>>



-- 
Kees Cook
Pixel Security


Re: [PATCH v5 01/15] stacktrace/x86: add function for detecting reliable stack traces

2017-03-07 Thread Josh Poimboeuf
On Tue, Mar 07, 2017 at 05:50:55PM +1100, Balbir Singh wrote:
> On Mon, 2017-02-13 at 19:42 -0600, Josh Poimboeuf wrote:
> > For live patching and possibly other use cases, a stack trace is only
> > useful if it can be assured that it's completely reliable.  Add a new
> > save_stack_trace_tsk_reliable() function to achieve that.
> > 
> > Note that if the target task isn't the current task, and the target task
> > is allowed to run, then it could be writing the stack while the unwinder
> > is reading it, resulting in possible corruption.  So the caller of
> > save_stack_trace_tsk_reliable() must ensure that the task is either
> > 'current' or inactive.
> > 
> > save_stack_trace_tsk_reliable() relies on the x86 unwinder's detection
> > of pt_regs on the stack.  If the pt_regs are not user-mode registers
> > from a syscall, then they indicate an in-kernel interrupt or exception
> > (e.g. preemption or a page fault), in which case the stack is considered
> > unreliable due to the nature of frame pointers.
> > 
> > It also relies on the x86 unwinder's detection of other issues, such as:
> > 
> > - corrupted stack data
> > - stack grows the wrong way
> > - stack walk doesn't reach the bottom
> > - user didn't provide a large enough entries array
> > 
> > Such issues are reported by checking unwind_error() and !unwind_done().
> > 
> > Also add CONFIG_HAVE_RELIABLE_STACKTRACE so arch-independent code can
> > determine at build time whether the function is implemented.
> > 
> > Signed-off-by: Josh Poimboeuf 
> > ---
> 
> Could you comment on why we need a reliable trace for live-patching? Are
> we in any way reliant on the stack trace to patch something broken?

I tried to cover this comprehensively in patch 13/15 in
Documentation/livepatch/livepatch.txt.  Does that answer your questions?

-- 
Josh


Re: [PATCH 5/6] perf: probes: move ftrace README parsing logic into trace-event-parse.c

2017-03-07 Thread Masami Hiramatsu
On Tue,  7 Mar 2017 16:17:40 +0530
"Naveen N. Rao"  wrote:

> probe-file.c needs libelf, but scanning ftrace README does not require
> that. As such, move the ftrace README scanning logic out of probe-file.c
> and into trace-event-parse.c.

As far as I can see, there is no reason to push this out from probe-file.c
because anyway this API using code requires libelf. Without this, I can
still build perf with NO_LIBELF=1. So I wouldn't like to pick this.
(I think we can drop this from this series)
Thank you,

> 
> Signed-off-by: Naveen N. Rao 
> ---
>  tools/perf/util/probe-file.c| 87 +++-
>  tools/perf/util/probe-file.h|  2 -
>  tools/perf/util/trace-event-parse.c | 89 
> +
>  tools/perf/util/trace-event.h   |  4 ++
>  4 files changed, 99 insertions(+), 83 deletions(-)
> 
> diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c
> index 1542cd0d6799..ff872fa30cdb 100644
> --- a/tools/perf/util/probe-file.c
> +++ b/tools/perf/util/probe-file.c
> @@ -26,6 +26,7 @@
>  #include 
>  #include "probe-event.h"
>  #include "probe-file.h"
> +#include "trace-event.h"
>  #include "session.h"
>  
>  #define MAX_CMDLEN 256
> @@ -70,33 +71,17 @@ static void print_both_open_warning(int kerr, int uerr)
>   }
>  }
>  
> -int open_trace_file(const char *trace_file, bool readwrite)
> -{
> - char buf[PATH_MAX];
> - int ret;
> -
> - ret = e_snprintf(buf, PATH_MAX, "%s/%s",
> -  tracing_path, trace_file);
> - if (ret >= 0) {
> - pr_debug("Opening %s write=%d\n", buf, readwrite);
> - if (readwrite && !probe_event_dry_run)
> - ret = open(buf, O_RDWR | O_APPEND, 0);
> - else
> - ret = open(buf, O_RDONLY, 0);
> -
> - if (ret < 0)
> - ret = -errno;
> - }
> - return ret;
> -}
> -
>  static int open_kprobe_events(bool readwrite)
>  {
> + if (probe_event_dry_run)
> + readwrite = false;
>   return open_trace_file("kprobe_events", readwrite);
>  }
>  
>  static int open_uprobe_events(bool readwrite)
>  {
> + if (probe_event_dry_run)
> + readwrite = false;
>   return open_trace_file("uprobe_events", readwrite);
>  }
>  
> @@ -877,72 +862,12 @@ int probe_cache__show_all_caches(struct strfilter 
> *filter)
>   return 0;
>  }
>  
> -enum ftrace_readme {
> - FTRACE_README_PROBE_TYPE_X = 0,
> - FTRACE_README_KRETPROBE_OFFSET,
> - FTRACE_README_END,
> -};
> -
> -static struct {
> - const char *pattern;
> - bool avail;
> -} ftrace_readme_table[] = {
> -#define DEFINE_TYPE(idx, pat)\
> - [idx] = {.pattern = pat, .avail = false}
> - DEFINE_TYPE(FTRACE_README_PROBE_TYPE_X, "*type: * x8/16/32/64,*"),
> - DEFINE_TYPE(FTRACE_README_KRETPROBE_OFFSET, "*place (kretprobe): *"),
> -};
> -
> -static bool scan_ftrace_readme(enum ftrace_readme type)
> -{
> - int fd;
> - FILE *fp;
> - char *buf = NULL;
> - size_t len = 0;
> - bool ret = false;
> - static bool scanned = false;
> -
> - if (scanned)
> - goto result;
> -
> - fd = open_trace_file("README", false);
> - if (fd < 0)
> - return ret;
> -
> - fp = fdopen(fd, "r");
> - if (!fp) {
> - close(fd);
> - return ret;
> - }
> -
> - while (getline(, , fp) > 0)
> - for (enum ftrace_readme i = 0; i < FTRACE_README_END; i++)
> - if (!ftrace_readme_table[i].avail)
> - ftrace_readme_table[i].avail =
> - strglobmatch(buf, 
> ftrace_readme_table[i].pattern);
> - scanned = true;
> -
> - fclose(fp);
> - free(buf);
> -
> -result:
> - if (type >= FTRACE_README_END)
> - return false;
> -
> - return ftrace_readme_table[type].avail;
> -}
> -
>  bool probe_type_is_available(enum probe_type type)
>  {
>   if (type >= PROBE_TYPE_END)
>   return false;
>   else if (type == PROBE_TYPE_X)
> - return scan_ftrace_readme(FTRACE_README_PROBE_TYPE_X);
> + return probe_type_x_is_supported();
>  
>   return true;
>  }
> -
> -bool kretprobe_offset_is_supported(void)
> -{
> - return scan_ftrace_readme(FTRACE_README_KRETPROBE_OFFSET);
> -}
> diff --git a/tools/perf/util/probe-file.h b/tools/perf/util/probe-file.h
> index dbf95a00864a..eba44c3e9dca 100644
> --- a/tools/perf/util/probe-file.h
> +++ b/tools/perf/util/probe-file.h
> @@ -35,7 +35,6 @@ enum probe_type {
>  
>  /* probe-file.c depends on libelf */
>  #ifdef HAVE_LIBELF_SUPPORT
> -int open_trace_file(const char *trace_file, bool readwrite);
>  int probe_file__open(int flag);
>  int probe_file__open_both(int *kfd, int *ufd, int flag);
>  struct strlist *probe_file__get_namelist(int fd);
> @@ 

Re: [PATCH v5.1 15/15] livepatch: allow removal of a disabled patch

2017-03-07 Thread Miroslav Benes
On Mon, 6 Mar 2017, Josh Poimboeuf wrote:

> 
> Currently we do not allow patch module to unload since there is no
> method to determine if a task is still running in the patched code.
> 
> The consistency model gives us the way because when the unpatching
> finishes we know that all tasks were marked as safe to call an original
> function. Thus every new call to the function calls the original code
> and at the same time no task can be somewhere in the patched code,
> because it had to leave that code to be marked as safe.
> 
> We can safely let the patch module go after that.
> 
> Completion is used for synchronization between module removal and sysfs
> infrastructure in a similar way to commit 942e443127e9 ("module: Fix
> mod->mkobj.kobj potentially freed too early").
> 
> Note that we still do not allow the removal for immediate model, that is
> no consistency model. The module refcount may increase in this case if
> somebody disables and enables the patch several times. This should not
> cause any harm.
> 
> With this change a call to try_module_get() is moved to
> __klp_enable_patch from klp_register_patch to make module reference
> counting symmetric (module_put() is in a patch disable path) and to
> allow to take a new reference to a disabled module when being enabled.
> 
> Finally, we need to be very careful about possible races between
> klp_unregister_patch(), kobject_put() functions and operations
> on the related sysfs files.
> 
> kobject_put(>kobj) must be called without klp_mutex. Otherwise,
> it might be blocked by enabled_store() that needs the mutex as well.
> In addition, enabled_store() must check if the patch was not
> unregisted in the meantime.
> 
> There is no need to do the same for other kobject_put() callsites
> at the moment. Their sysfs operations neither take the lock nor
> they access any data that might be freed in the meantime.
> 
> There was an attempt to use kobjects the right way and prevent these
> races by design. But it made the patch definition more complicated
> and opened another can of worms. See
> https://lkml.kernel.org/r/1464018848-4303-1-git-send-email-pmla...@suse.com
> 
> [Thanks to Petr Mladek for improving the commit message.]
> 
> Signed-off-by: Miroslav Benes 
> Signed-off-by: Josh Poimboeuf 
> Reviewed-by: Petr Mladek 
> ---
> v5.1: improve error handling in enable path -- call module_put() in
>   klp_cancel_transition()

Looks good.

Acked-by: Miroslav Benes 

if it is needed for Josh's code.

Regards,
Miroslav


Re: [PATCH 5/6] perf: probes: move ftrace README parsing logic into trace-event-parse.c

2017-03-07 Thread Naveen N. Rao
On 2017/03/07 03:03PM, Masami Hiramatsu wrote:
> On Tue,  7 Mar 2017 16:17:40 +0530
> "Naveen N. Rao"  wrote:
> 
> > probe-file.c needs libelf, but scanning ftrace README does not require
> > that. As such, move the ftrace README scanning logic out of probe-file.c
> > and into trace-event-parse.c.
> 
> Hmm, it seems probe-file.c doesn't require libelf at all...
> I would like to keep ftrace related things in probe-file.c.

Not sure I understand. probe-file.h explicitly calls out a need for 
libelf due to the probe cache and related routines - commit 
40218daea1db1 ("perf list: Show SDT and pre-cached events").

However, if you prefer to retain the ftrace README scanning here, we can 
drop this patch and I can update patch 6 to check for libelf.

Thanks,
Naveen



Re: [PATCH v5 13/15] livepatch: change to a per-task consistency model

2017-03-07 Thread Miroslav Benes
On Mon, 13 Feb 2017, Josh Poimboeuf wrote:

> Change livepatch to use a basic per-task consistency model.  This is the
> foundation which will eventually enable us to patch those ~10% of
> security patches which change function or data semantics.  This is the
> biggest remaining piece needed to make livepatch more generally useful.
> 
> This code stems from the design proposal made by Vojtech [1] in November
> 2014.  It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
> consistency and syscall barrier switching combined with kpatch's stack
> trace switching.  There are also a number of fallback options which make
> it quite flexible.
> 
> Patches are applied on a per-task basis, when the task is deemed safe to
> switch over.  When a patch is enabled, livepatch enters into a
> transition state where tasks are converging to the patched state.
> Usually this transition state can complete in a few seconds.  The same
> sequence occurs when a patch is disabled, except the tasks converge from
> the patched state to the unpatched state.
> 
> An interrupt handler inherits the patched state of the task it
> interrupts.  The same is true for forked tasks: the child inherits the
> patched state of the parent.
> 
> Livepatch uses several complementary approaches to determine when it's
> safe to patch tasks:
> 
> 1. The first and most effective approach is stack checking of sleeping
>tasks.  If no affected functions are on the stack of a given task,
>the task is patched.  In most cases this will patch most or all of
>the tasks on the first try.  Otherwise it'll keep trying
>periodically.  This option is only available if the architecture has
>reliable stacks (HAVE_RELIABLE_STACKTRACE).
> 
> 2. The second approach, if needed, is kernel exit switching.  A
>task is switched when it returns to user space from a system call, a
>user space IRQ, or a signal.  It's useful in the following cases:
> 
>a) Patching I/O-bound user tasks which are sleeping on an affected
>   function.  In this case you have to send SIGSTOP and SIGCONT to
>   force it to exit the kernel and be patched.
>b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
>   then it will get patched the next time it gets interrupted by an
>   IRQ.
>c) In the future it could be useful for applying patches for
>   architectures which don't yet have HAVE_RELIABLE_STACKTRACE.  In
>   this case you would have to signal most of the tasks on the
>   system.  However this isn't supported yet because there's
>   currently no way to patch kthreads without
>   HAVE_RELIABLE_STACKTRACE.
> 
> 3. For idle "swapper" tasks, since they don't ever exit the kernel, they
>instead have a klp_update_patch_state() call in the idle loop which
>allows them to be patched before the CPU enters the idle state.
> 
>(Note there's not yet such an approach for kthreads.)
> 
> All the above approaches may be skipped by setting the 'immediate' flag
> in the 'klp_patch' struct, which will disable per-task consistency and
> patch all tasks immediately.  This can be useful if the patch doesn't
> change any function or data semantics.  Note that, even with this flag
> set, it's possible that some tasks may still be running with an old
> version of the function, until that function returns.
> 
> There's also an 'immediate' flag in the 'klp_func' struct which allows
> you to specify that certain functions in the patch can be applied
> without per-task consistency.  This might be useful if you want to patch
> a common function like schedule(), and the function change doesn't need
> consistency but the rest of the patch does.
> 
> For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user
> must set patch->immediate which causes all tasks to be patched
> immediately.  This option should be used with care, only when the patch
> doesn't change any function or data semantics.
> 
> In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE
> may be allowed to use per-task consistency if we can come up with
> another way to patch kthreads.
> 
> The /sys/kernel/livepatch//transition file shows whether a patch
> is in transition.  Only a single patch (the topmost patch on the stack)
> can be in transition at a given time.  A patch can remain in transition
> indefinitely, if any of the tasks are stuck in the initial patch state.
> 
> A transition can be reversed and effectively canceled by writing the
> opposite value to the /sys/kernel/livepatch//enabled file while
> the transition is in progress.  Then all the tasks will attempt to
> converge back to the original patch state.
> 
> [1] https://lkml.kernel.org/r/20141107140458.ga21...@suse.cz
> 
> Signed-off-by: Josh Poimboeuf 

I looked at the patch again and could not see any problem with it. I 
tested it with a couple of live patches too and it worked as expected. 
Good job.

Acked-by: Miroslav Benes 

Re: [PATCH 5/6] perf: probes: move ftrace README parsing logic into trace-event-parse.c

2017-03-07 Thread Masami Hiramatsu
On Tue,  7 Mar 2017 16:17:40 +0530
"Naveen N. Rao"  wrote:

> probe-file.c needs libelf, but scanning ftrace README does not require
> that. As such, move the ftrace README scanning logic out of probe-file.c
> and into trace-event-parse.c.

Hmm, it seems probe-file.c doesn't require libelf at all...
I would like to keep ftrace related things in probe-file.c.

Thanks,

> 
> Signed-off-by: Naveen N. Rao 
> ---
>  tools/perf/util/probe-file.c| 87 +++-
>  tools/perf/util/probe-file.h|  2 -
>  tools/perf/util/trace-event-parse.c | 89 
> +
>  tools/perf/util/trace-event.h   |  4 ++
>  4 files changed, 99 insertions(+), 83 deletions(-)
> 
> diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c
> index 1542cd0d6799..ff872fa30cdb 100644
> --- a/tools/perf/util/probe-file.c
> +++ b/tools/perf/util/probe-file.c
> @@ -26,6 +26,7 @@
>  #include 
>  #include "probe-event.h"
>  #include "probe-file.h"
> +#include "trace-event.h"
>  #include "session.h"
>  
>  #define MAX_CMDLEN 256
> @@ -70,33 +71,17 @@ static void print_both_open_warning(int kerr, int uerr)
>   }
>  }
>  
> -int open_trace_file(const char *trace_file, bool readwrite)
> -{
> - char buf[PATH_MAX];
> - int ret;
> -
> - ret = e_snprintf(buf, PATH_MAX, "%s/%s",
> -  tracing_path, trace_file);
> - if (ret >= 0) {
> - pr_debug("Opening %s write=%d\n", buf, readwrite);
> - if (readwrite && !probe_event_dry_run)
> - ret = open(buf, O_RDWR | O_APPEND, 0);
> - else
> - ret = open(buf, O_RDONLY, 0);
> -
> - if (ret < 0)
> - ret = -errno;
> - }
> - return ret;
> -}
> -
>  static int open_kprobe_events(bool readwrite)
>  {
> + if (probe_event_dry_run)
> + readwrite = false;
>   return open_trace_file("kprobe_events", readwrite);
>  }
>  
>  static int open_uprobe_events(bool readwrite)
>  {
> + if (probe_event_dry_run)
> + readwrite = false;
>   return open_trace_file("uprobe_events", readwrite);
>  }
>  
> @@ -877,72 +862,12 @@ int probe_cache__show_all_caches(struct strfilter 
> *filter)
>   return 0;
>  }
>  
> -enum ftrace_readme {
> - FTRACE_README_PROBE_TYPE_X = 0,
> - FTRACE_README_KRETPROBE_OFFSET,
> - FTRACE_README_END,
> -};
> -
> -static struct {
> - const char *pattern;
> - bool avail;
> -} ftrace_readme_table[] = {
> -#define DEFINE_TYPE(idx, pat)\
> - [idx] = {.pattern = pat, .avail = false}
> - DEFINE_TYPE(FTRACE_README_PROBE_TYPE_X, "*type: * x8/16/32/64,*"),
> - DEFINE_TYPE(FTRACE_README_KRETPROBE_OFFSET, "*place (kretprobe): *"),
> -};
> -
> -static bool scan_ftrace_readme(enum ftrace_readme type)
> -{
> - int fd;
> - FILE *fp;
> - char *buf = NULL;
> - size_t len = 0;
> - bool ret = false;
> - static bool scanned = false;
> -
> - if (scanned)
> - goto result;
> -
> - fd = open_trace_file("README", false);
> - if (fd < 0)
> - return ret;
> -
> - fp = fdopen(fd, "r");
> - if (!fp) {
> - close(fd);
> - return ret;
> - }
> -
> - while (getline(, , fp) > 0)
> - for (enum ftrace_readme i = 0; i < FTRACE_README_END; i++)
> - if (!ftrace_readme_table[i].avail)
> - ftrace_readme_table[i].avail =
> - strglobmatch(buf, 
> ftrace_readme_table[i].pattern);
> - scanned = true;
> -
> - fclose(fp);
> - free(buf);
> -
> -result:
> - if (type >= FTRACE_README_END)
> - return false;
> -
> - return ftrace_readme_table[type].avail;
> -}
> -
>  bool probe_type_is_available(enum probe_type type)
>  {
>   if (type >= PROBE_TYPE_END)
>   return false;
>   else if (type == PROBE_TYPE_X)
> - return scan_ftrace_readme(FTRACE_README_PROBE_TYPE_X);
> + return probe_type_x_is_supported();
>  
>   return true;
>  }
> -
> -bool kretprobe_offset_is_supported(void)
> -{
> - return scan_ftrace_readme(FTRACE_README_KRETPROBE_OFFSET);
> -}
> diff --git a/tools/perf/util/probe-file.h b/tools/perf/util/probe-file.h
> index dbf95a00864a..eba44c3e9dca 100644
> --- a/tools/perf/util/probe-file.h
> +++ b/tools/perf/util/probe-file.h
> @@ -35,7 +35,6 @@ enum probe_type {
>  
>  /* probe-file.c depends on libelf */
>  #ifdef HAVE_LIBELF_SUPPORT
> -int open_trace_file(const char *trace_file, bool readwrite);
>  int probe_file__open(int flag);
>  int probe_file__open_both(int *kfd, int *ufd, int flag);
>  struct strlist *probe_file__get_namelist(int fd);
> @@ -65,7 +64,6 @@ struct probe_cache_entry *probe_cache__find_by_name(struct 
> probe_cache *pcache,
>   

Re: [PATCH kernel v6 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-07 Thread David Gibson
On Tue, Mar 07, 2017 at 10:07:27PM +1100, Alexey Kardashevskiy wrote:
> On 06/03/17 16:04, Alexey Kardashevskiy wrote:
> > On 06/03/17 15:30, David Gibson wrote:
> >> On Fri, Mar 03, 2017 at 06:09:25PM +1100, Alexey Kardashevskiy wrote:
> >>> On 03/03/17 16:59, David Gibson wrote:
>  On Thu, Mar 02, 2017 at 07:56:44PM +1100, Alexey Kardashevskiy wrote:
> > This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> > and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> > without passing them to user space which saves time on switching
> > to user space and back.
> >
> > This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> > KVM tries to handle a TCE request in the real mode, if failed
> > it passes the request to the virtual mode to complete the operation.
> > If it a virtual mode handler fails, the request is passed to
> > the user space; this is not expected to happen though.
> >
> > To avoid dealing with page use counters (which is tricky in real mode),
> > this only accelerates SPAPR TCE IOMMU v2 clients which are required
> > to pre-register the userspace memory. The very first TCE request will
> > be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> > of the TCE table (iommu_table::it_userspace) is not allocated till
> > the very first mapping happens and we cannot call vmalloc in real mode.
> >
> > If we fail to update a hardware IOMMU table unexpected reason, we just
> > clear it and move on as there is nothing really we can do about it -
> > for example, if we hot plug a VFIO device to a guest, existing TCE 
> > tables
> > will be mirrored automatically to the hardware and there is no interface
> > to report to the guest about possible failures.
> >
> > This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> > the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> > and associates a physical IOMMU table with the SPAPR TCE table (which
> > is a guest view of the hardware IOMMU table). The iommu_table object
> > is cached and referenced so we do not have to look up for it in real 
> > mode.
> >
> > This does not implement the UNSET counterpart as there is no use for it 
> > -
> > once the acceleration is enabled, the existing userspace won't
> > disable it unless a VFIO container is destroyed; this adds necessary
> > cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >
> > As this creates a descriptor per IOMMU table-LIOBN couple (called
> > kvmppc_spapr_tce_iommu_table), it is possible to have several
> > descriptors with the same iommu_table (hardware IOMMU table) attached
> > to the same LIOBN; we do not remove duplicates though as
> > iommu_table_ops::exchange not just update a TCE entry (which is
> > shared among IOMMU groups) but also invalidates the TCE cache
> > (one per IOMMU group).
> >
> > This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> > space.
> >
> > This finally makes use of vfio_external_user_iommu_id() which was
> > introduced quite some time ago and was considered for removal.
> >
> > Tests show that this patch increases transmission speed from 220MB/s
> > to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >
> > Signed-off-by: Alexey Kardashevskiy 
> > ---
> > Changes:
> > v6:
> > * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> > * moved kvmppc_gpa_to_ua() to TCE validation
> >
> > v5:
> > * changed error codes in multiple places
> > * added bunch of WARN_ON() in places which should not really happen
> > * adde a check that an iommu table is not attached already to LIOBN
> > * dropped explicit calls to iommu_tce_clear_param_check/
> > iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> > call them anyway (since the previous patch)
> > * if we fail to update a hardware IOMMU table for unexpected reason,
> > this just clears the entry
> >
> > v4:
> > * added note to the commit log about allowing multiple updates of
> > the same IOMMU table;
> > * instead of checking for if any memory was preregistered, this
> > returns H_TOO_HARD if a specific page was not;
> > * fixed comments from v3 about error handling in many places;
> > * simplified TCE handlers and merged IOMMU parts inline - for example,
> > there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> > kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> > the first attached table only (makes the code simpler);
> >
> > v3:
> > * simplified not to use VFIO group notifiers
> > * reworked cleanup, should be cleaner/simpler now
> >
> > v2:
> > * reworked to use new 

[GIT PULL] Please pull powerpc/linux.git powerpc-4.11-3 tag

2017-03-07 Thread Michael Ellerman
Hi Linus,

Please pull the first set of powerpc fixes for 4.11:

The following changes since commit b286cedd473006b33d5ae076afac509e6b2c3bf4:

  Merge tag 'powerpc-4.11-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux (2017-03-01 
10:10:16 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-4.11-3

for you to fetch changes up to a7d2475af7aedcb9b5c6343989a8bfadbf84429b:

  powerpc: Sort the selects under CONFIG_PPC (2017-03-06 23:05:42 +1100)


powerpc fixes for 4.11 #3

Five fairly small fixes for things that went in this cycle.

A fairly large patch to rework the CAS logic on Power9, necessitated by a late
change to the firmware API, and we can't boot without it.

Three fixes going to stable, allowing more instructions to be emulated on LE,
fixing a boot crash on 32-bit Freescale BookE machines, and the OPAL XICS
workaround.

And a patch from me to sort the selects under CONFIG PPC. Annoying churn, but
worth it in the long run, and best for it to go in now to avoid conflicts.

Thanks to:
  Alexey Kardashevskiy, Anton Blanchard, Balbir Singh, Gautham R. Shenoy,
  Laurentiu Tudor, Nicholas Piggin, Paul Mackerras, Ravi Bangoria, Sachin Sant,
  Shile Zhang, Suraj Jitindar Singh.


Alexey Kardashevskiy (1):
  powerpc/powernv: Fix opal tracepoints with JUMP_LABEL=n

Anton Blanchard (1):
  powerpc/64: Avoid panic during boot due to divide by zero in 
init_cache_info()

Balbir Singh (1):
  powerpc/xics: Work around limitations of OPAL XICS priority handling

Gautham R. Shenoy (1):
  powerpc/powernv: Fix bug due to labeling ambiguity in power_enter_stop

Laurentiu Tudor (1):
  powerpc/booke: Fix boot crash due to null hugepd

Michael Ellerman (2):
  powerpc/64: Fix L1D cache shape vector reporting L1I values
  powerpc: Sort the selects under CONFIG_PPC

Nicholas Piggin (1):
  powerpc: Fix compiling a BE kernel with a powerpc64le toolchain

Paul Mackerras (1):
  powerpc/64: Invalidate process table caching after setting process table

Ravi Bangoria (2):
  powerpc: Emulation support for load/store instructions on LE
  powerpc: emulate_step() tests for load/store instructions

Sachin Sant (1):
  selftest/powerpc: Fix false failures for skipped tests

Shile Zhang (1):
  powerpc/64: Fix checksum folding in csum_add()

Suraj Jitindar Singh (2):
  powerpc: Parse the command line before calling CAS
  powerpc: Update to new option-vector-5 format for CAS

 arch/powerpc/Kconfig   | 138 
 arch/powerpc/Makefile  |  11 +-
 arch/powerpc/include/asm/checksum.h|   2 +-
 arch/powerpc/include/asm/cpuidle.h |   4 +-
 arch/powerpc/include/asm/elf.h |   4 +-
 arch/powerpc/include/asm/nohash/pgtable.h  |   2 +-
 arch/powerpc/include/asm/ppc-opcode.h  |   7 +
 arch/powerpc/include/asm/prom.h|  18 +-
 arch/powerpc/kernel/idle_book3s.S  |  10 +-
 arch/powerpc/kernel/prom_init.c| 120 ++-
 arch/powerpc/kernel/setup_64.c |   5 +-
 arch/powerpc/lib/Makefile  |   1 +
 arch/powerpc/lib/sstep.c   |  20 --
 arch/powerpc/lib/test_emulate_step.c   | 434 +
 arch/powerpc/mm/init_64.c  |  36 +-
 arch/powerpc/mm/pgtable-radix.c|   4 +
 arch/powerpc/platforms/powernv/opal-wrappers.S |   4 +-
 arch/powerpc/sysdev/xics/icp-opal.c|  10 +
 arch/powerpc/sysdev/xics/xics-common.c |  17 +-
 tools/testing/selftests/powerpc/harness.c  |   6 +-
 20 files changed, 729 insertions(+), 124 deletions(-)
 create mode 100644 arch/powerpc/lib/test_emulate_step.c


signature.asc
Description: PGP signature


Re: open list?

2017-03-07 Thread Christophe LEROY



Le 07/03/2017 à 11:02, Tobin C. Harding a écrit :

scripts/get_maintainers.pl says this is an open list;

linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND
64-BIT))

Patches I've sent with this list cc'd have not been getting through. I
resent one to check if it was a user error at my end.

This email will obviously serve as another test.

Is there something I am doing wrong?


I got both the one you sent yesterday and today's resend.

Both can be seen at 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?submitter=68963


Christophe



thanks,
Tobin.




[PATCH][Linux] powerpc/64s: cpufeatures: add initial implementation for cpufeatures

2017-03-07 Thread Nicholas Piggin
The /cpus/features dt binding describes architected CPU features along
with some compatibility, privilege, and enablement properties that allow
flexibility with discovering and enabling capabilities.

Presence of this feature implies a base level of functionality, then
additional feature nodes advertise the presence of new features.

A given feature and its setup procedure is defined once and used by all
CPUs which are compatible by that feature. Features that follow a
supported "prescription" can be enabled by a hypervisor or OS that
does not understand them natively.

Using this, we can do CPU setup without matching CPU tables on PVR, and
mambo can boot in POWER8 or POWER9 mode. Modulo allowances to support
MCE and PMU.

I'm looking at cpu features and hfscr/fscr/lpcr/msr etc bits before and
after the patch to make sure we're doing things properly. There are still
a few small differences:

POWER8
- CPU_FTR_NODSISRALIGN is now clear. DSISR is set on alignment interrupts.
- HFSCR bit 54 and 57 are now clear. This appears to be a mambo issue.

POWER9
- VRMASD is clear from LPCR. This is not supported in ISA3 mode.
- Privileged Doorbell Exit Enable is set in LPCR.

Before asking to merge this patch I'll send patches to existing cputable
code to change those and update this patch if they are rejected, so the
differences are squashed before merging.

Thanks,
Nick
---
 .../devicetree/bindings/powerpc/cpufeatures.txt| 263 ++
 arch/powerpc/include/asm/cpu_has_feature.h |   4 +-
 arch/powerpc/include/asm/cpufeatures.h |  54 ++
 arch/powerpc/include/asm/cputable.h|   1 +
 arch/powerpc/include/asm/reg.h |   1 +
 arch/powerpc/kernel/Makefile   |   1 +
 arch/powerpc/kernel/cpufeatures.c  | 565 +
 arch/powerpc/kernel/cputable.c |  14 +-
 arch/powerpc/kernel/prom.c | 201 +++-
 arch/powerpc/kernel/setup-common.c |   2 +-
 arch/powerpc/kernel/setup_64.c |  21 +-
 drivers/of/fdt.c   |  31 ++
 include/linux/of_fdt.h |   5 +
 13 files changed, 1144 insertions(+), 19 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/powerpc/cpufeatures.txt
 create mode 100644 arch/powerpc/include/asm/cpufeatures.h
 create mode 100644 arch/powerpc/kernel/cpufeatures.c

diff --git a/Documentation/devicetree/bindings/powerpc/cpufeatures.txt 
b/Documentation/devicetree/bindings/powerpc/cpufeatures.txt
new file mode 100644
index ..4bb6d92a856e
--- /dev/null
+++ b/Documentation/devicetree/bindings/powerpc/cpufeatures.txt
@@ -0,0 +1,263 @@
+powerpc cpu features binding
+
+
+The device tree describes supported CPU features as nodes containing
+compatibility and enablement information as properties.
+
+The binding specifies features common to all CPUs in the system.
+Heterogeneous CPU features are not supported at present (such could be added
+by providing nodes with additional features and linking those to particular
+CPUs).
+
+This binding is intended to provide fine grained control of CPU features at
+all levels of the stack (firmware, hypervisor, OS, userspace), with the
+ability for new CPU features to be used by some components without all
+components being upgraded (e.g., a new floating point instruction could be
+used by userspace math library without upgrading kernel and hypervisor).
+
+The binding is passed to the hypervisor by firmware. The hypervisor must
+remove any features that require hypervisor enablement but that it does not
+enable. It must remove any features that depend on removed features. It may
+pass remaining features usable to the OS and PR to guests, depending on
+configuration policy (not specified here).
+
+The modified binding is passed to the guest by hypervisor, with HV bit
+cleared from the usable-mask and the hv-support and hfscr-bit properties
+removed. The guest must similarly rmeove features that require OS enablement
+that it does not enable. The OS may pass PR usable features to userspace via
+ELF AUX vectors AT_HWCAP, AT_HWCAP2, AT_HWCAP3, etc., or use some other
+method (outside the scope of this specification).
+
+The binding will specify a "base" level of features that will be present
+when the cpu features binding exists. Additional features will be explicitly
+specified.
+
+/cpus/features node binding
+---
+
+Node: features
+
+Description: Container of CPU feature nodes.
+
+The node name must be "features" and it must be a child of the node "/cpus".
+
+The node is optional but should be provided by new firmware.
+
+Each child node of cpufeatures represents an architected CPU feature (e.g.,
+a new set of vector instructions) or an important CPU performance
+characteristic (e.g., fast unaligned memory operations). The specification
+of each feature (instructions, 

[PATCH][OPAL] cpufeatures: add base and POWER8, POWER9 /cpus/features dt

2017-03-07 Thread Nicholas Piggin
With this patch and the Linux one, I can boot (in mambo) a POWER8 or
POWER9 without looking up any cpu tables, and mainly looking at PVR
for MCE and PMU.

Machine and ISA speicfic features that are not abstracted by firmware
and not captured here will have to be handled on a case by case basis,
using PVR if necessary. Major ones that remain are PMU and machine check.

Open question is where and how to develop and document these features?
Not the dt representation created by skiboot, but the exact nature of
each feature. What exact behaviour does a particular feature imply, etc.

Thanks,
Nick
---
 core/device.c  |   7 +
 core/init.c|   3 +
 hdata/cpu-common.c | 602 +
 include/device.h   |   1 +
 4 files changed, 613 insertions(+)

diff --git a/core/device.c b/core/device.c
index 30b31f46..1900ba71 100644
--- a/core/device.c
+++ b/core/device.c
@@ -548,6 +548,13 @@ u32 dt_property_get_cell(const struct dt_property *prop, 
u32 index)
return fdt32_to_cpu(((const u32 *)prop->prop)[index]);
 }
 
+void dt_property_set_cell(struct dt_property *prop, u32 index, u32 val)
+{
+   assert(prop->len >= (index+1)*sizeof(u32));
+   /* Always aligned, so this works. */
+   ((u32 *)prop->prop)[index] = cpu_to_fdt32(val);
+}
+
 /* First child of this node. */
 struct dt_node *dt_first(const struct dt_node *root)
 {
diff --git a/core/init.c b/core/init.c
index 58f96f47..938920eb 100644
--- a/core/init.c
+++ b/core/init.c
@@ -703,6 +703,8 @@ static void per_thread_sanity_checks(void)
 /* Called from head.S, thus no prototype. */
 void main_cpu_entry(const void *fdt);
 
+extern void mambo_add_cpu_features(struct dt_node *root);
+
 void __noreturn __nomcount main_cpu_entry(const void *fdt)
 {
/*
@@ -774,6 +776,7 @@ void __noreturn __nomcount main_cpu_entry(const void *fdt)
abort();
} else {
dt_expand(fdt);
+   mambo_add_cpu_features(dt_root);
}
 
/* Now that we have a full devicetree, verify that we aren't on fire. */
diff --git a/hdata/cpu-common.c b/hdata/cpu-common.c
index aa2752c1..1da1b1cb 100644
--- a/hdata/cpu-common.c
+++ b/hdata/cpu-common.c
@@ -21,6 +21,599 @@
 
 #include "hdata.h"
 
+/* Table to set up the /cpus/features dt */
+#define USABLE_PR  (1U << 0)
+#define USABLE_OS  (1U << 1)
+#define USABLE_HV  (1U << 2)
+
+#define HV_SUPPORT_NONE0
+#define HV_SUPPORT_CUSTOM  1
+#define HV_SUPPORT_HFSCR   2
+
+#define OS_SUPPORT_NONE0
+#define OS_SUPPORT_CUSTOM  1
+#define OS_SUPPORT_FSCR2
+
+#define CPU_P8 0x1
+#define CPU_P9_DD1 0x2
+#define CPU_P9_DD2 0x4
+
+#define CPU_P9 (CPU_P9_DD1|CPU_P9_DD2)
+#define CPU_ALL(CPU_P8|CPU_P9)
+
+#define ISA_BASE   0
+#define ISA_V3 3000
+
+struct cpu_feature {
+   const char *name;
+   uint32_t cpus_supported;
+   uint32_t isa;
+   uint32_t usable_mask;
+   uint32_t hv_support;
+   uint32_t os_support;
+   uint32_t hfscr_bit_nr;
+   uint32_t fscr_bit_nr;
+   uint32_t hwcap_bit_nr;
+   const char *dependencies_names; /* space-delimited names */
+};
+
+/*
+ * The base (or NULL) cpu feature set is the CPU features available
+ * when no child nodes of the /cpus/features node exist. The base feature
+ * set is POWER8 (ISA v2.07), less features that are listed explicitly.
+ *
+ * There will be a /cpus/features/isa property that specifies the currently
+ * active ISA level. Those architected features without explicit nodes
+ * will match the current ISA level. A greater ISA level will imply some
+ * features are phased out.
+ */
+static const struct cpu_feature cpu_features_table[] = {
+   { "big-endian",
+   CPU_ALL,
+   ISA_BASE, USABLE_HV|USABLE_OS|USABLE_PR,
+   HV_SUPPORT_CUSTOM, OS_SUPPORT_CUSTOM,
+   -1, -1, -1,
+   NULL, },
+
+   { "little-endian",
+   CPU_ALL,
+   ISA_BASE, USABLE_HV|USABLE_OS|USABLE_PR,
+   HV_SUPPORT_CUSTOM, OS_SUPPORT_CUSTOM,
+   -1, -1, -1,
+   NULL, },
+
+   /* MSR_HV mode */
+   { "hypervisor",
+   CPU_ALL,
+   ISA_BASE, USABLE_HV,
+   HV_SUPPORT_CUSTOM, OS_SUPPORT_NONE,
+   -1, -1, -1,
+   NULL, },
+
+   { "smt",
+   CPU_ALL,
+   ISA_BASE, USABLE_HV|USABLE_OS|USABLE_PR,
+   HV_SUPPORT_CUSTOM, OS_SUPPORT_CUSTOM,
+   -1, -1, 17,
+   NULL, },
+
+   /* PPR */
+   { "program-priority-register",
+   CPU_ALL,
+   ISA_BASE, USABLE_HV|USABLE_OS|USABLE_PR,
+   HV_SUPPORT_NONE, OS_SUPPORT_NONE,
+   -1, -1, -1,
+   NULL, },
+
+   { "strong-access-ordering",
+   CPU_ALL & ~CPU_P9_DD1,
+   ISA_BASE, USABLE_HV|USABLE_OS|USABLE_PR,
+   HV_SUPPORT_CUSTOM, OS_SUPPORT_CUSTOM,
+   -1, -1, -1,
+   NULL, },
+
+   { 

[PATCH 0/2 v2] cpufeatures compatibility for OPAL and Linux

2017-03-07 Thread Nicholas Piggin
Hi,

This is a bit more complete implementation that also works on
POWER9. Any comments would be welcome.

Thanks,
Nick


Re: [PATCH kernel v6 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-07 Thread Alexey Kardashevskiy
On 06/03/17 16:04, Alexey Kardashevskiy wrote:
> On 06/03/17 15:30, David Gibson wrote:
>> On Fri, Mar 03, 2017 at 06:09:25PM +1100, Alexey Kardashevskiy wrote:
>>> On 03/03/17 16:59, David Gibson wrote:
 On Thu, Mar 02, 2017 at 07:56:44PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
>
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
>
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
>
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
>
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
>
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
>
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
>
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
>
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v6:
> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> * moved kvmppc_gpa_to_ua() to TCE validation
>
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
>
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
>
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
>
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed 
> later
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/kvm_host.h|   8 +
>  arch/powerpc/include/asm/kvm_ppc.h  

[RESEND PATCH 1/6] trace/kprobes: fix check for kretprobe offset within function entry

2017-03-07 Thread Naveen N. Rao
perf specifies an offset from _text and since this offset is fed
directly into the arch-specific helper, kprobes tracer rejects
installation of kretprobes through perf. Fix this by looking up the
actual offset from a function for the specified sym+offset.

Refactor and reuse existing routines to limit code duplication -- we
repurpose kprobe_addr() for determining final kprobe address and we
split out the function entry offset determination into a separate
generic helper.

Before patch:

  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
  probe-definition(0): do_open%return
  symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
  0 arguments
  Looking at the vmlinux_path (8 entries long)
  Using /boot/vmlinux for symbols
  Open Debuginfo file: /boot/vmlinux
  Try to find probe point from debuginfo.
  Matched function: do_open [2d0c7ff]
  Probe point found: do_open+0
  Matched function: do_open [35d76dc]
  found inline addr: 0xc04ba9c4
  Failed to find "do_open%return",
   because do_open is an inlined function and has no return point.
  An error occurred in debuginfo analysis (-22).
  Trying to use symbols.
  Opening /sys/kernel/debug/tracing//README write=0
  Opening /sys/kernel/debug/tracing//kprobe_events write=1
  Writing event: r:probe/do_open _text+4469776
  Failed to write event: Invalid argument
Error: Failed to add events. Reason: Invalid argument (Code: -22)
  naveen@ubuntu:~/linux/tools/perf$ dmesg | tail
  
  [   33.568656] Given offset is not valid for return probe.

After patch:

  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
  probe-definition(0): do_open%return
  symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
  0 arguments
  Looking at the vmlinux_path (8 entries long)
  Using /boot/vmlinux for symbols
  Open Debuginfo file: /boot/vmlinux
  Try to find probe point from debuginfo.
  Matched function: do_open [2d0c7d6]
  Probe point found: do_open+0
  Matched function: do_open [35d76b3]
  found inline addr: 0xc04ba9e4
  Failed to find "do_open%return",
   because do_open is an inlined function and has no return point.
  An error occurred in debuginfo analysis (-22).
  Trying to use symbols.
  Opening /sys/kernel/debug/tracing//README write=0
  Opening /sys/kernel/debug/tracing//kprobe_events write=1
  Writing event: r:probe/do_open _text+4469808
  Writing event: r:probe/do_open_1 _text+4956344
  Added new events:
probe:do_open(on do_open%return)
probe:do_open_1  (on do_open%return)

  You can now use it in all perf tools, such as:

  perf record -e probe:do_open_1 -aR sleep 1

  naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list
  c0041370  k  kretprobe_trampoline+0x0[OPTIMIZED]
  c04ba0b8  r  do_open+0x8[DISABLED]
  c0443430  r  do_open+0x0[DISABLED]

Acked-by: Masami Hiramatsu 
Signed-off-by: Naveen N. Rao 
---
 include/linux/kprobes.h |  1 +
 kernel/kprobes.c| 40 ++--
 kernel/trace/trace_kprobe.c |  2 +-
 3 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index 177bdf6c6aeb..47e4da5b4fa2 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -268,6 +268,7 @@ extern void show_registers(struct pt_regs *regs);
 extern void kprobes_inc_nmissed_count(struct kprobe *p);
 extern bool arch_within_kprobe_blacklist(unsigned long addr);
 extern bool arch_function_offset_within_entry(unsigned long offset);
+extern bool function_offset_within_entry(kprobe_opcode_t *addr, const char 
*sym, unsigned long offset);
 
 extern bool within_kprobe_blacklist(unsigned long addr);
 
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 448759d4a263..32e6ac5131ed 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1391,21 +1391,19 @@ bool within_kprobe_blacklist(unsigned long addr)
  * This returns encoded errors if it fails to look up symbol or invalid
  * combination of parameters.
  */
-static kprobe_opcode_t *kprobe_addr(struct kprobe *p)
+static kprobe_opcode_t *_kprobe_addr(kprobe_opcode_t *addr,
+   const char *symbol_name, unsigned int offset)
 {
-   kprobe_opcode_t *addr = p->addr;
-
-   if ((p->symbol_name && p->addr) ||
-   (!p->symbol_name && !p->addr))
+   if ((symbol_name && addr) || (!symbol_name && !addr))
goto invalid;
 
-   if (p->symbol_name) {
-   kprobe_lookup_name(p->symbol_name, addr);
+   if (symbol_name) {
+   kprobe_lookup_name(symbol_name, addr);
if (!addr)
return ERR_PTR(-ENOENT);
}
 
-   addr = (kprobe_opcode_t *)(((char *)addr) + p->offset);
+   addr = (kprobe_opcode_t *)(((char *)addr) + offset);
if (addr)
return addr;
 
@@ -1413,6 +1411,11 @@ static 

[RESEND PATCH 3/6] perf: probe: factor out the ftrace README scanning

2017-03-07 Thread Naveen N. Rao
Simplify and separate out the ftrace README scanning logic into a
separate helper. This is used subsequently to scan for all patterns of
interest and to cache the result.

Since we are only interested in availability of probe argument type x,
we will only scan for that.

Acked-by: Masami Hiramatsu 
Signed-off-by: Naveen N. Rao 
---
 tools/perf/util/probe-file.c | 70 +++-
 1 file changed, 37 insertions(+), 33 deletions(-)

diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c
index 1a62daceb028..8a219cd831b7 100644
--- a/tools/perf/util/probe-file.c
+++ b/tools/perf/util/probe-file.c
@@ -877,35 +877,31 @@ int probe_cache__show_all_caches(struct strfilter *filter)
return 0;
 }
 
+enum ftrace_readme {
+   FTRACE_README_PROBE_TYPE_X = 0,
+   FTRACE_README_END,
+};
+
 static struct {
const char *pattern;
-   boolavail;
-   boolchecked;
-} probe_type_table[] = {
-#define DEFINE_TYPE(idx, pat, def_avail)   \
-   [idx] = {.pattern = pat, .avail = (def_avail)}
-   DEFINE_TYPE(PROBE_TYPE_U, "* u8/16/32/64,*", true),
-   DEFINE_TYPE(PROBE_TYPE_S, "* s8/16/32/64,*", true),
-   DEFINE_TYPE(PROBE_TYPE_X, "* x8/16/32/64,*", false),
-   DEFINE_TYPE(PROBE_TYPE_STRING, "* string,*", true),
-   DEFINE_TYPE(PROBE_TYPE_BITFIELD,
-   "* b@/", true),
+   bool avail;
+} ftrace_readme_table[] = {
+#define DEFINE_TYPE(idx, pat)  \
+   [idx] = {.pattern = pat, .avail = false}
+   DEFINE_TYPE(FTRACE_README_PROBE_TYPE_X, "*type: * x8/16/32/64,*"),
 };
 
-bool probe_type_is_available(enum probe_type type)
+static bool scan_ftrace_readme(enum ftrace_readme type)
 {
+   int fd;
FILE *fp;
char *buf = NULL;
size_t len = 0;
-   bool target_line = false;
-   bool ret = probe_type_table[type].avail;
-   int fd;
+   bool ret = false;
+   static bool scanned = false;
 
-   if (type >= PROBE_TYPE_END)
-   return false;
-   /* We don't have to check the type which supported by default */
-   if (ret || probe_type_table[type].checked)
-   return ret;
+   if (scanned)
+   goto result;
 
fd = open_trace_file("README", false);
if (fd < 0)
@@ -917,21 +913,29 @@ bool probe_type_is_available(enum probe_type type)
return ret;
}
 
-   while (getline(, , fp) > 0 && !ret) {
-   if (!target_line) {
-   target_line = !!strstr(buf, " type: ");
-   if (!target_line)
-   continue;
-   } else if (strstr(buf, "\t  ") != buf)
-   break;
-   ret = strglobmatch(buf, probe_type_table[type].pattern);
-   }
-   /* Cache the result */
-   probe_type_table[type].checked = true;
-   probe_type_table[type].avail = ret;
+   while (getline(, , fp) > 0)
+   for (enum ftrace_readme i = 0; i < FTRACE_README_END; i++)
+   if (!ftrace_readme_table[i].avail)
+   ftrace_readme_table[i].avail =
+   strglobmatch(buf, 
ftrace_readme_table[i].pattern);
+   scanned = true;
 
fclose(fp);
free(buf);
 
-   return ret;
+result:
+   if (type >= FTRACE_README_END)
+   return false;
+
+   return ftrace_readme_table[type].avail;
+}
+
+bool probe_type_is_available(enum probe_type type)
+{
+   if (type >= PROBE_TYPE_END)
+   return false;
+   else if (type == PROBE_TYPE_X)
+   return scan_ftrace_readme(FTRACE_README_PROBE_TYPE_X);
+
+   return true;
 }
-- 
2.11.1



[RESEND PATCH 2/6] powerpc: kretprobes: override default function entry offset

2017-03-07 Thread Naveen N. Rao
With ABIv2, we offset 8 bytes into a function to get at the local entry
point.

Acked-by: Ananth N Mavinakayanahalli 
Acked-by: Michael Ellerman 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/kprobes.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index fce05a38851c..331751701fed 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -131,6 +131,15 @@ static void __kprobes set_current_kprobe(struct kprobe *p, 
struct pt_regs *regs,
kcb->kprobe_saved_msr = regs->msr;
 }
 
+bool arch_function_offset_within_entry(unsigned long offset)
+{
+#ifdef PPC64_ELF_ABI_v2
+   return offset <= 8;
+#else
+   return !offset;
+#endif
+}
+
 void __kprobes arch_prepare_kretprobe(struct kretprobe_instance *ri,
  struct pt_regs *regs)
 {
-- 
2.11.1



[PATCH 5/6] perf: probes: move ftrace README parsing logic into trace-event-parse.c

2017-03-07 Thread Naveen N. Rao
probe-file.c needs libelf, but scanning ftrace README does not require
that. As such, move the ftrace README scanning logic out of probe-file.c
and into trace-event-parse.c.

Signed-off-by: Naveen N. Rao 
---
 tools/perf/util/probe-file.c| 87 +++-
 tools/perf/util/probe-file.h|  2 -
 tools/perf/util/trace-event-parse.c | 89 +
 tools/perf/util/trace-event.h   |  4 ++
 4 files changed, 99 insertions(+), 83 deletions(-)

diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c
index 1542cd0d6799..ff872fa30cdb 100644
--- a/tools/perf/util/probe-file.c
+++ b/tools/perf/util/probe-file.c
@@ -26,6 +26,7 @@
 #include 
 #include "probe-event.h"
 #include "probe-file.h"
+#include "trace-event.h"
 #include "session.h"
 
 #define MAX_CMDLEN 256
@@ -70,33 +71,17 @@ static void print_both_open_warning(int kerr, int uerr)
}
 }
 
-int open_trace_file(const char *trace_file, bool readwrite)
-{
-   char buf[PATH_MAX];
-   int ret;
-
-   ret = e_snprintf(buf, PATH_MAX, "%s/%s",
-tracing_path, trace_file);
-   if (ret >= 0) {
-   pr_debug("Opening %s write=%d\n", buf, readwrite);
-   if (readwrite && !probe_event_dry_run)
-   ret = open(buf, O_RDWR | O_APPEND, 0);
-   else
-   ret = open(buf, O_RDONLY, 0);
-
-   if (ret < 0)
-   ret = -errno;
-   }
-   return ret;
-}
-
 static int open_kprobe_events(bool readwrite)
 {
+   if (probe_event_dry_run)
+   readwrite = false;
return open_trace_file("kprobe_events", readwrite);
 }
 
 static int open_uprobe_events(bool readwrite)
 {
+   if (probe_event_dry_run)
+   readwrite = false;
return open_trace_file("uprobe_events", readwrite);
 }
 
@@ -877,72 +862,12 @@ int probe_cache__show_all_caches(struct strfilter *filter)
return 0;
 }
 
-enum ftrace_readme {
-   FTRACE_README_PROBE_TYPE_X = 0,
-   FTRACE_README_KRETPROBE_OFFSET,
-   FTRACE_README_END,
-};
-
-static struct {
-   const char *pattern;
-   bool avail;
-} ftrace_readme_table[] = {
-#define DEFINE_TYPE(idx, pat)  \
-   [idx] = {.pattern = pat, .avail = false}
-   DEFINE_TYPE(FTRACE_README_PROBE_TYPE_X, "*type: * x8/16/32/64,*"),
-   DEFINE_TYPE(FTRACE_README_KRETPROBE_OFFSET, "*place (kretprobe): *"),
-};
-
-static bool scan_ftrace_readme(enum ftrace_readme type)
-{
-   int fd;
-   FILE *fp;
-   char *buf = NULL;
-   size_t len = 0;
-   bool ret = false;
-   static bool scanned = false;
-
-   if (scanned)
-   goto result;
-
-   fd = open_trace_file("README", false);
-   if (fd < 0)
-   return ret;
-
-   fp = fdopen(fd, "r");
-   if (!fp) {
-   close(fd);
-   return ret;
-   }
-
-   while (getline(, , fp) > 0)
-   for (enum ftrace_readme i = 0; i < FTRACE_README_END; i++)
-   if (!ftrace_readme_table[i].avail)
-   ftrace_readme_table[i].avail =
-   strglobmatch(buf, 
ftrace_readme_table[i].pattern);
-   scanned = true;
-
-   fclose(fp);
-   free(buf);
-
-result:
-   if (type >= FTRACE_README_END)
-   return false;
-
-   return ftrace_readme_table[type].avail;
-}
-
 bool probe_type_is_available(enum probe_type type)
 {
if (type >= PROBE_TYPE_END)
return false;
else if (type == PROBE_TYPE_X)
-   return scan_ftrace_readme(FTRACE_README_PROBE_TYPE_X);
+   return probe_type_x_is_supported();
 
return true;
 }
-
-bool kretprobe_offset_is_supported(void)
-{
-   return scan_ftrace_readme(FTRACE_README_KRETPROBE_OFFSET);
-}
diff --git a/tools/perf/util/probe-file.h b/tools/perf/util/probe-file.h
index dbf95a00864a..eba44c3e9dca 100644
--- a/tools/perf/util/probe-file.h
+++ b/tools/perf/util/probe-file.h
@@ -35,7 +35,6 @@ enum probe_type {
 
 /* probe-file.c depends on libelf */
 #ifdef HAVE_LIBELF_SUPPORT
-int open_trace_file(const char *trace_file, bool readwrite);
 int probe_file__open(int flag);
 int probe_file__open_both(int *kfd, int *ufd, int flag);
 struct strlist *probe_file__get_namelist(int fd);
@@ -65,7 +64,6 @@ struct probe_cache_entry *probe_cache__find_by_name(struct 
probe_cache *pcache,
const char *group, const char *event);
 int probe_cache__show_all_caches(struct strfilter *filter);
 bool probe_type_is_available(enum probe_type type);
-bool kretprobe_offset_is_supported(void);
 #else  /* ! HAVE_LIBELF_SUPPORT */
 static inline struct probe_cache *probe_cache__new(const char *tgt 
__maybe_unused)
 {
diff --git a/tools/perf/util/trace-event-parse.c 
b/tools/perf/util/trace-event-parse.c
index 

[RESEND PATCH 6/6] perf: powerpc: choose local entry point with kretprobes

2017-03-07 Thread Naveen N. Rao
perf now uses an offset from _text/_stext for kretprobes if the kernel
supports it, rather than the actual function name. As such, let's choose
the LEP for powerpc ABIv2 so as to ensure the probe gets hit. Do it only
if the kernel supports specifying offsets with kretprobes.

Acked-by: Masami Hiramatsu 
Signed-off-by: Naveen N. Rao 
---
 tools/perf/arch/powerpc/util/sym-handling.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tools/perf/arch/powerpc/util/sym-handling.c 
b/tools/perf/arch/powerpc/util/sym-handling.c
index 1030a6e504bb..e93b3db25012 100644
--- a/tools/perf/arch/powerpc/util/sym-handling.c
+++ b/tools/perf/arch/powerpc/util/sym-handling.c
@@ -10,6 +10,7 @@
 #include "symbol.h"
 #include "map.h"
 #include "probe-event.h"
+#include "trace-event.h"
 
 #ifdef HAVE_LIBELF_SUPPORT
 bool elf__needs_adjust_symbols(GElf_Ehdr ehdr)
@@ -79,11 +80,12 @@ void arch__fix_tev_from_maps(struct perf_probe_event *pev,
 * However, if the user specifies an offset, we fall back to using the
 * GEP since all userspace applications (objdump/readelf) show function
 * disassembly with offsets from the GEP.
-*
-* In addition, we shouldn't specify an offset for kretprobes.
 */
-   if (pev->point.offset || (!pev->uprobes && pev->point.retprobe) ||
-   !map || !sym)
+   if (pev->point.offset || !map || !sym)
+   return;
+
+   /* For kretprobes, add an offset only if the kernel supports it */
+   if (!pev->uprobes && pev->point.retprobe && 
!kretprobe_offset_is_supported())
return;
 
lep_offset = PPC64_LOCAL_ENTRY_OFFSET(sym->arch_sym);
-- 
2.11.1



[RESEND PATCH 4/6] perf: kretprobes: offset from reloc_sym if kernel supports it

2017-03-07 Thread Naveen N. Rao
We indicate support for accepting sym+offset with kretprobes through a
line in ftrace README. Parse the same to identify support and choose the
appropriate format for kprobe_events.

As an example, without this perf patch, but with the ftrace changes:

  naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/tracing/README | 
grep kretprobe
  place (kretprobe): [:][+]|
  naveen@ubuntu:~/linux/tools/perf$
  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
  probe-definition(0): do_open%return
  symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
  0 arguments
  Looking at the vmlinux_path (8 entries long)
  Using /boot/vmlinux for symbols
  Open Debuginfo file: /boot/vmlinux
  Try to find probe point from debuginfo.
  Matched function: do_open [2d0c7d8]
  Probe point found: do_open+0
  Matched function: do_open [35d76b5]
  found inline addr: 0xc04ba984
  Failed to find "do_open%return",
   because do_open is an inlined function and has no return point.
  An error occurred in debuginfo analysis (-22).
  Trying to use symbols.
  Opening /sys/kernel/debug/tracing//kprobe_events write=1
  Writing event: r:probe/do_open do_open+0
  Writing event: r:probe/do_open_1 do_open+0
  Added new events:
probe:do_open(on do_open%return)
probe:do_open_1  (on do_open%return)

  You can now use it in all perf tools, such as:

  perf record -e probe:do_open_1 -aR sleep 1

  naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list
  c0041370  k  kretprobe_trampoline+0x0[OPTIMIZED]
  c04433d0  r  do_open+0x0[DISABLED]
  c04433d0  r  do_open+0x0[DISABLED]

And after this patch (and the subsequent powerpc patch):

  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
  probe-definition(0): do_open%return
  symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
  0 arguments
  Looking at the vmlinux_path (8 entries long)
  Using /boot/vmlinux for symbols
  Open Debuginfo file: /boot/vmlinux
  Try to find probe point from debuginfo.
  Matched function: do_open [2d0c7d8]
  Probe point found: do_open+0
  Matched function: do_open [35d76b5]
  found inline addr: 0xc04ba984
  Failed to find "do_open%return",
   because do_open is an inlined function and has no return point.
  An error occurred in debuginfo analysis (-22).
  Trying to use symbols.
  Opening /sys/kernel/debug/tracing//README write=0
  Opening /sys/kernel/debug/tracing//kprobe_events write=1
  Writing event: r:probe/do_open _text+4469712
  Writing event: r:probe/do_open_1 _text+4956248
  Added new events:
probe:do_open(on do_open%return)
probe:do_open_1  (on do_open%return)

  You can now use it in all perf tools, such as:

  perf record -e probe:do_open_1 -aR sleep 1

  naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list
  c0041370  k  kretprobe_trampoline+0x0[OPTIMIZED]
  c04433d0  r  do_open+0x0[DISABLED]
  c04ba058  r  do_open+0x8[DISABLED]

Acked-by: Masami Hiramatsu 
Signed-off-by: Naveen N. Rao 
---
 tools/perf/util/probe-event.c | 12 +---
 tools/perf/util/probe-file.c  |  7 +++
 tools/perf/util/probe-file.h  |  1 +
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index 28fb62c32678..c9bdc9ded0c3 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -757,7 +757,9 @@ post_process_kernel_probe_trace_events(struct 
probe_trace_event *tevs,
}
 
for (i = 0; i < ntevs; i++) {
-   if (!tevs[i].point.address || tevs[i].point.retprobe)
+   if (!tevs[i].point.address)
+   continue;
+   if (tevs[i].point.retprobe && !kretprobe_offset_is_supported())
continue;
/* If we found a wrong one, mark it by NULL symbol */
if (kprobe_warn_out_range(tevs[i].point.symbol,
@@ -1528,11 +1530,6 @@ static int parse_perf_probe_point(char *arg, struct 
perf_probe_event *pev)
return -EINVAL;
}
 
-   if (pp->retprobe && !pp->function) {
-   semantic_error("Return probe requires an entry function.\n");
-   return -EINVAL;
-   }
-
if ((pp->offset || pp->line || pp->lazy_line) && pp->retprobe) {
semantic_error("Offset/Line/Lazy pattern can't be used with "
   "return probe.\n");
@@ -2841,7 +2838,8 @@ static int find_probe_trace_events_from_map(struct 
perf_probe_event *pev,
}
 
/* Note that the symbols in the kmodule are not relocated */
-   if (!pev->uprobes && !pp->retprobe && !pev->target) {
+   if (!pev->uprobes && !pev->target &&
+   (!pp->retprobe || kretprobe_offset_is_supported())) {

Re: [PATCH v4 2/3] perf: kretprobes: offset from reloc_sym if kernel

2017-03-07 Thread Naveen N. Rao
On 2017/03/06 10:06PM, Masami Hiramatsu wrote:
> On Mon,  6 Mar 2017 23:19:09 +0530
> "Naveen N. Rao"  wrote:
> 
> > Masami,
> > Your patch works, thanks! However, I felt we could refactor and reuse
> > some of the code across kprobes.c for this purpose. Can you please see
> > if the below patch is fine?
> 
> OK, looks good to me:)
> 
> Acked-by: Masami Hiramatsu 

Thanks for the review, Masami!
I ended up adding one more patch to this series (patch 5/6) to move the
ftrace README scanning out of probe-file.c, as it doesn't need libelf.
Patch 6 fails to build without libelf otherwise. Please take a look.

Arnaldo,
I am re-sending the remaining patches in this series which apply on top
of the 4 patches you sent to Ingo, so as to keep this simple. All the
patches have been acked, except the new patch 5/6. Kindly take a look.

Thanks,
Naveen

--
Naveen N. Rao (6):
  trace/kprobes: fix check for kretprobe offset within function entry
  powerpc: kretprobes: override default function entry offset
  perf: probe: factor out the ftrace README scanning
  perf: kretprobes: offset from reloc_sym if kernel supports it
  perf: probes: move ftrace README parsing logic into
trace-event-parse.c
  perf: powerpc: choose local entry point with kretprobes

 arch/powerpc/kernel/kprobes.c   |  9 +++
 include/linux/kprobes.h |  1 +
 kernel/kprobes.c| 40 -
 kernel/trace/trace_kprobe.c |  2 +-
 tools/perf/arch/powerpc/util/sym-handling.c | 10 ++--
 tools/perf/util/probe-event.c   | 12 ++--
 tools/perf/util/probe-file.c| 80 +++---
 tools/perf/util/probe-file.h|  1 -
 tools/perf/util/trace-event-parse.c | 89 +
 tools/perf/util/trace-event.h   |  4 ++
 10 files changed, 149 insertions(+), 99 deletions(-)

-- 
2.11.1



Re: [PATCH] selftests/powerpc: Replace stxvx and lxvx with their equivalent instruction

2017-03-07 Thread Balbir Singh
On 07-Mar-2017 11:43 AM, "Cyril Bur"  wrote:

On POWER8 (ISA 2.07) lxvx and stxvx are defined to be extended mnemonics
of lxvd2x and stxvd2x. For POWER9 (ISA 3.0) the HW architects in their
infinite wisdom made lxvx and stxvx instructions in their own right.

POWER9 aware GCC will use the POWER9 instruction for lxvx and stxvx
causing these selftests to fail on POWER8. Further compounding the
issue, because of the way -mvsx works it will cause the power9
instructions to be used regardless of -mcpu=power8 to GCC or -mpower8 to
AS.

The safest way to address the problem for now is to not use the extended
mnemonic. These tests only perform register comparisons the big endian
only byte ordering for stxvd2x and lxvd2x does not impact the test.

Signed-off-by: Cyril Bur 
---



Acked-by: Balbir Singh


Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-03-07 Thread Peter Zijlstra
On Tue, Mar 07, 2017 at 03:28:17PM +0530, Madhavan Srinivasan wrote:
> 
> 
> On Monday 06 March 2017 04:52 PM, Peter Zijlstra wrote:
> >On Mon, Mar 06, 2017 at 04:13:08PM +0530, Madhavan Srinivasan wrote:
> >>From: Sukadev Bhattiprolu 
> >>
> >>perf_mem_data_src is an union that is initialized via the ->val field
> >>and accessed via the bitmap fields. For this to work on big endian
> >>platforms, we also need a big-endian represenation of perf_mem_data_src.
> >Doesn't this break interpreting the data on a different endian machine?
> 
> IIUC, we will need this patch to not to break the interpreting data
> on a different endian machine. Data collected from power8 LE/BE
> guests with this patchset applied. Kindly correct me if I missed
> your question here.

So your patch adds compile time bitfield differences. My worry was that
there was no dynamic conversion routine in the tools (it has for a lot
of other places).

This yields two questions:

 - are these two static layouts identical? (seeing that you illustrate
   cross-endian things working this seems likely).

 - should you not have fixed this in the tool only? This patch
   effectively breaks ABI on big-endian architectures.


open list?

2017-03-07 Thread Tobin C. Harding
scripts/get_maintainers.pl says this is an open list;

linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND
64-BIT))

Patches I've sent with this list cc'd have not been getting through. I
resent one to check if it was a user error at my end.

This email will obviously serve as another test.

Is there something I am doing wrong?

thanks,
Tobin.


Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-03-07 Thread Madhavan Srinivasan



On Monday 06 March 2017 04:52 PM, Peter Zijlstra wrote:

On Mon, Mar 06, 2017 at 04:13:08PM +0530, Madhavan Srinivasan wrote:

From: Sukadev Bhattiprolu 

perf_mem_data_src is an union that is initialized via the ->val field
and accessed via the bitmap fields. For this to work on big endian
platforms, we also need a big-endian represenation of perf_mem_data_src.

Doesn't this break interpreting the data on a different endian machine?


IIUC, we will need this patch to not to break the interpreting data
on a different endian machine. Data collected from power8 LE/BE
guests with this patchset applied. Kindly correct me if I missed
your question here.


With this patchset applied, perf.data from a power8 BigEndian guest:
==

$ sudo ./perf record -d -e mem_access ls
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.007 MB perf.data (8 samples) ]

$ sudo ./perf report --mem-mode --stdio
  # To display the perf.data header info, please use 
--header/--header-only options.

  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 8  of event 'mem_access'
  # Total weight : 8
  # Sort order   : 
local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked

  #
  # Overhead  Local Weight  Memory access Symbol   
Shared Object Data Symbol Data 
Object Snoop TLB access  Locked
  #      
...   
.. 
..   ..  
..

  #
  25.00%  0 L2 hit[H] 
0xc000c910   [unknown] [H] 0xc00f170e5310 
[unknown]   N/A N/A No
  12.50%  0 L2 hit[k] 
.idle_cpu[kernel.vmlinux]  [k] __per_cpu_offset+0x68 
[kernel.vmlinux].data..read_mostly  N/A N/A No
  12.50%  0 L2 hit[H] 
0xc000ca58   [unknown] [H] 0xc00f170e5200 
[unknown]   N/A N/A No
  12.50%  0 L3 hit[k] 
.copypage_power7 [kernel.vmlinux]  [k] 0xc0002f6fc600 
[kernel.vmlinux].bssN/A N/A No
  12.50%  0 L3 hit[k] 
.copypage_power7 [kernel.vmlinux]  [k] 0xc0003f8b1980 
[kernel.vmlinux].bssN/A N/A No
  12.50%  0 Local RAM hit [k] 
._raw_spin_lock_irqsave  [kernel.vmlinux]  [k] 0xc00033b5bdf4 
[kernel.vmlinux].bssMiss N/A No
  12.50%  0 Remote Cache (1 hop) hit  [k] 
.perf_iterate_ctx[kernel.vmlinux]  [k] 0xc0e88648 
[kernel.vmlinux]HitM N/A No



perf report from power8 LittleEndian guest (with this patch applied to 
perf tool):

==

$ ./perf report --mem-mode --stdio -i perf.data.p8be.withpatch
  No kallsyms or vmlinux with build-id 
ca8a1a9d4b62b2a67ee01050afb1dfa03565a655 was found
  /boot/vmlinux with build id ca8a1a9d4b62b2a67ee01050afb1dfa03565a655 
not found, continuing without symbols
  No kallsyms or vmlinux with build-id 
ca8a1a9d4b62b2a67ee01050afb1dfa03565a655 was found
  /boot/vmlinux with build id ca8a1a9d4b62b2a67ee01050afb1dfa03565a655 
not found, continuing without symbols
  # To display the perf.data header info, please use 
--header/--header-only options.

  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 8  of event 'mem_access'
  # Total weight : 8
  # Sort order   : 
local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked

  #
  # Overhead  Local Weight  Memory access Symbol  
Shared Object Data Symbol Data Object   Snoop TLB 
access  Locked
  #      
..    .. 
    ..  ..

  #
  25.00%  0 L2 hit[H] 
0xc000c910  [unknown] [H] 0xc00f170e5310 
[unknown] N/A   N/A No
  12.50%  0 L2 hit[k] 
0xc00f4d0c  [kernel.vmlinux]  [k] 0xc0f2dac8 
[kernel.vmlinux]  N/A   N/A No
  12.50%  0 L2 hit[H] 
0xc000ca58  [unknown] [H] 0xc00f170e5200 
[unknown] N/A   N/A No
  12.50%  0 L3 hit[k] 
0xc006b560  [kernel.vmlinux]  [k] 

[RESEND PATCH] powerpc/pseries: move struct hcall_stats to c file

2017-03-07 Thread Tobin C. Harding
struct hcall_stats is only used in hvCall_inst.c.

Move struct hcall_stats to hvCall_inst.c

Resolves: #54
Signed-off-by: Tobin C. Harding 
---

Is this correct, adding 'Resolves: #XX' when fixing
github.com/linuxppc/linux issues?

 arch/powerpc/include/asm/hvcall.h| 10 --
 arch/powerpc/platforms/pseries/hvCall_inst.c | 10 ++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 77ff1ba..74599bd 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -364,16 +364,6 @@ long plpar_hcall_raw(unsigned long opcode, unsigned long 
*retbuf, ...);
 long plpar_hcall9(unsigned long opcode, unsigned long *retbuf, ...);
 long plpar_hcall9_raw(unsigned long opcode, unsigned long *retbuf, ...);
 
-/* For hcall instrumentation.  One structure per-hcall, per-CPU */
-struct hcall_stats {
-   unsigned long   num_calls;  /* number of calls (on this CPU) */
-   unsigned long   tb_total;   /* total wall time (mftb) of calls. */
-   unsigned long   purr_total; /* total cpu time (PURR) of calls. */
-   unsigned long   tb_start;
-   unsigned long   purr_start;
-};
-#define HCALL_STAT_ARRAY_SIZE  ((MAX_HCALL_OPCODE >> 2) + 1)
-
 struct hvcall_mpp_data {
unsigned long entitled_mem;
unsigned long mapped_mem;
diff --git a/arch/powerpc/platforms/pseries/hvCall_inst.c 
b/arch/powerpc/platforms/pseries/hvCall_inst.c
index f02ec3a..892db4f 100644
--- a/arch/powerpc/platforms/pseries/hvCall_inst.c
+++ b/arch/powerpc/platforms/pseries/hvCall_inst.c
@@ -29,6 +29,16 @@
 #include 
 #include 
 
+/* For hcall instrumentation.  One structure per-hcall, per-CPU */
+struct hcall_stats {
+   unsigned long   num_calls;  /* number of calls (on this CPU) */
+   unsigned long   tb_total;   /* total wall time (mftb) of calls. */
+   unsigned long   purr_total; /* total cpu time (PURR) of calls. */
+   unsigned long   tb_start;
+   unsigned long   purr_start;
+};
+#define HCALL_STAT_ARRAY_SIZE  ((MAX_HCALL_OPCODE >> 2) + 1)
+
 DEFINE_PER_CPU(struct hcall_stats[HCALL_STAT_ARRAY_SIZE], hcall_stats);
 
 /*
-- 
2.7.4



Re: [PATCH v5 00/15] livepatch: hybrid consistency model

2017-03-07 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

>  arch/Kconfig |   6 +
>  arch/powerpc/include/asm/thread_info.h   |   4 +-
>  arch/powerpc/kernel/signal.c |   4 +
>  arch/s390/include/asm/thread_info.h  |  24 +-
>  arch/s390/kernel/entry.S |  31 +-
>  arch/x86/Kconfig |   1 +
>  arch/x86/entry/common.c  |   9 +-
>  arch/x86/include/asm/thread_info.h   |  13 +-
>  arch/x86/include/asm/unwind.h|   6 +
>  arch/x86/kernel/stacktrace.c |  96 +++-
>  arch/x86/kernel/unwind_frame.c   |   2 +

for the x86 and scheduler changes:

Acked-by: Ingo Molnar 

Thanks,

Ingo