date:20160803

This time with a PGP signature

-- 
Cheers,
Stephen Rothwell


pgpxkiMyeBowX.pgp
Description: OpenPGP digital signature

[PATCH] powerpc/eeh: Fix slot locations on NPU and legacy platforms

2016-08-03 Thread Russell Currey

The slot location code as part of EEH has never functioned perfectly on
every powerpc system.  The device node properties "ibm,slot-loc",
"ibm,slot-location-code" and "ibm,io-base-loc-code" have all been
presented in different cases, and in some situations, there are legacy
platforms not conforming to the conventions of populating root buses with
"ibm,io-base-loc-code" and child nodes with "ibm,slot-location-code".

Specifically, some legacy platforms use "ibm,loc-code" instead, which
stopped working with 7e56f627768.  In addition, EEH PEs for NPU devices
have slot locations specified on the devices instead of buses due to their
architecture, and these were not printed.  This has been fixed by looking
at the top device of a PE for a slot location before checking its bus.

Fixes: 7e56f627768 "powerpc/eeh: Fix PE location code"
Cc:  #4.4+
Signed-off-by: Russell Currey 
---
 arch/powerpc/kernel/eeh_pe.c | 31 ++-
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index f0520da..034538c 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -881,17 +881,34 @@ void eeh_pe_restore_bars(struct eeh_pe *pe)
  * eeh_pe_loc_get - Retrieve location code binding to the given PE
  * @pe: EEH PE
  *
- * Retrieve the location code of the given PE. If the primary PE bus
- * is root bus, we will grab location code from PHB device tree node
- * or root port. Otherwise, the upstream bridge's device tree node
- * of the primary PE bus will be checked for the location code.
+ * Retrieve the location code of the given PE. The first device associated
+ * with the PE is checked for a slot location.  If missing, the bus of the
+ * device is checked instead.  If this is a root bus, the location code is
+ * taken from the PHB device tree node or root port.  If not, the upstream
+ * bridge's device tree node of the primary PE bus will be checked instead.
+ * If a slot location isn't found on the bus, walk through parent buses
+ * until a location is found.
  */
 const char *eeh_pe_loc_get(struct eeh_pe *pe)
 {
-   struct pci_bus *bus = eeh_pe_bus_get(pe);
+   struct pci_bus *bus;
struct device_node *dn;
const char *loc = NULL;
 
+   /* Check the slot location of the first (top) PCI device */
+   struct eeh_dev *edev =
+   list_first_entry_or_null(&pe->edevs, struct eeh_dev, list);
+
+   if (edev) {
+   loc = of_get_property(edev->pdn->node,
+ "ibm,slot-location-code", NULL);
+   if (loc)
+   return loc;
+   }
+
+   /* If there's nothing on the device, look at the bus */
+   bus = eeh_pe_bus_get(pe);
+
while (bus) {
dn = pci_bus_to_OF_node(bus);
if (!dn) {
@@ -905,6 +922,10 @@ const char *eeh_pe_loc_get(struct eeh_pe *pe)
loc = of_get_property(dn, "ibm,slot-location-code",
  NULL);
 
+   /* Fall back to ibm,loc-code if nothing else is found */
+   if (!loc)
+   loc = of_get_property(dn, "ibm,loc-code", NULL);
+
if (loc)
return loc;
 
-- 
2.9.2

Re: [PATCH] crypto: powerpc - CRYPT_CRC32C_VPMSUM should depend on ALTIVEC

2016-08-03 Thread Anton Blanchard

Hi Michael,

> The optimised crc32c implementation depends on VMX (aka. Altivec)
> instructions, so the kernel must be built with Altivec support in
> order for the crc32c code to build.

Thanks for that, looks good.

Acked-by: Anton Blanchard 

> Fixes: 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c")
> Signed-off-by: Michael Ellerman 
> ---
>  crypto/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/crypto/Kconfig b/crypto/Kconfig
> index a9377bef25e3..84d71482bf08 100644
> --- a/crypto/Kconfig
> +++ b/crypto/Kconfig
> @@ -439,7 +439,7 @@ config CRYPTO_CRC32C_INTEL
>  
>  config CRYPT_CRC32C_VPMSUM
>   tristate "CRC32c CRC algorithm (powerpc64)"
> - depends on PPC64
> + depends on PPC64 && ALTIVEC
>   select CRYPTO_HASH
>   select CRC32
>   help

[PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE.

2016-08-03 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

When machine check occurs with MSR(RI=0), it means MC interrupt is
unrecoverable and kernel goes down to panic path. But the console
message still shows it as recovered. This patch fixes the MCE console
messages.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/kernel/mce.c |3 ++-
 arch/powerpc/platforms/powernv/opal.c |2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index ef267fd..5e7ece0 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -92,7 +92,8 @@ void save_mce_event(struct pt_regs *regs, long handled,
mce->in_use = 1;
 
mce->initiator = MCE_INITIATOR_CPU;
-   if (handled)
+   /* Mark it recovered if we have handled it and MSR(RI=1). */
+   if (handled && (regs->msr & MSR_RI))
mce->disposition = MCE_DISPOSITION_RECOVERED;
else
mce->disposition = MCE_DISPOSITION_NOT_RECOVERED;
diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index 5385434..8154171 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -401,6 +401,8 @@ static int opal_recover_mce(struct pt_regs *regs,
 
if (!(regs->msr & MSR_RI)) {
/* If MSR_RI isn't set, we cannot recover */
+   printk(KERN_ERR "Machine check interrupt unrecoverable:"
+   " MSR(RI=0)\n");
recovered = 0;
} else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
/* Platform corrected itself */

Re: [PATCH] powerpc/xics: Properly set Edge/Level type and enable resend

Benjamin Herrenschmidt  writes:

> This sets the type of the interrupt appropriately. We set it as follow:
>
>  - If not mapped from the device-tree, we use edge. This is the case
> of the virtual interrupts and PCI MSIs for example.
>
>  - If mapped from the device-tree and #interrupt-cells is 2 (PAPR
> compliant), we use the second cell to set the appropriate type
>
>  - If mapped from the device-tree and #interrupt-cells is 1 (current
> OPAL on P8 does that), we assume level sensitive since those are
> typically going to be the PSI LSIs which are level sensitive.
>
> Additionally, we mark the interrupts requested via the opal_interrupts
> property all level. This is a bit fishy but the best we can do until we
> fix OPAL to properly expose them with a complete descriptor. It is also
> correct for the current HW anyway as OPAL interrupts are currently PCI
> error and PSI interrupts which are level.
>
> Finally now that edge interrupts are properly identified, we can enable
> CONFIG_HARDIRQS_SW_RESEND which will make the core re-send them if
> they occur while masked, which some drivers rely upon.
>
> This fixes issues with lost interrupts on some Mellanox adapters.
>
> Signed-off-by: Benjamin Herrenschmidt 

Broken since forever?

Cc stable?

cheers

Re: [PATCH v2 3/3] powernv: Fix MCE handler to avoid trashing CR0/CR1 registers.

2016-08-03 Thread Stewart Smith

Mahesh J Salgaonkar  writes:
> From: Mahesh Salgaonkar 
>
> The current implementation of MCE early handling modifies CR0/1 registers
> without saving its old values. Fix this by moving early check for
> powersaving mode to machine_check_handle_early().

>From (internal bug report) it seems as though in a test where one
injects continuous SLB Multi Hit errors, this bug could lead to rebooting
"due to to Platform error" rather than continuing to recover
successfully. It might be a good idea to mention that in commit message
here.

Also, should this go to stable?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v2] powerpc/32: fix csum_partial_copy_generic()

2016-08-03 Thread Scott Wood

On Tue, 2016-08-02 at 10:07 +0200, Christophe Leroy wrote:
> commit 7aef4136566b0 ("powerpc32: rewrite csum_partial_copy_generic()
> based on copy_tofrom_user()") introduced a bug when destination
> address is odd and initial csum is not null
> 
> In that (rare) case the initial csum value has to be rotated one byte
> as well as the resulting value is
> 
> This patch also fixes related comments
> 
> Fixes: 7aef4136566b0 ("powerpc32: rewrite csum_partial_copy_generic()
> based on copy_tofrom_user()")
> Cc: sta...@vger.kernel.org
> 
> Signed-off-by: Christophe Leroy 
> ---
>  v2: updated comments as suggested by Segher
> 
>  arch/powerpc/lib/checksum_32.S | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)

Alessio, can you confirm whether this fixes the problem you reported?

-Scott

> 
> diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
> index d90870a..0a57fe6 100644
> --- a/arch/powerpc/lib/checksum_32.S
> +++ b/arch/powerpc/lib/checksum_32.S
> @@ -127,8 +127,9 @@ _GLOBAL(csum_partial_copy_generic)
>   stw r7,12(r1)
>   stw r8,8(r1)
>  
> - andi.   r0,r4,1 /* is destination
> address even ? */
> - cmplwi  cr7,r0,0
> + rlwinm  r0,r4,3,0x8
> + rlwnm   r6,r6,r0,0,31   /* odd destination address:
> rotate one byte */
> + cmplwi  cr7,r0,0/* is destination address even ? */
>   addic   r12,r6,0
>   addir6,r4,-4
>   neg r0,r4
> @@ -237,7 +238,7 @@ _GLOBAL(csum_partial_copy_generic)
>  66:  addze   r3,r12
>   addir1,r1,16
>   beqlr+  cr7
> - rlwinm  r3,r3,8,0,31/* swap bytes for odd destination
> */
> + rlwinm  r3,r3,8,0,31/* odd destination address:
> rotate one byte */
>   blr
>  
>  /* read fault */

DMARC (and DKIM) problems

Hi all,

For some time we have been coping with DMARC by rewriting the sender address
for any email sent from a site with a restrictive DMARC policy.  This was
because the DKIM verification would fail for such an email once it had been
processed by the mailing list software and so sites (like Yahoo) who
implemented DMARC would bounce such emails.

It turns out that by just not adding the footer to each email, we no
longer break the DKIM signatures.  So, I have turned off the footer and
will leave it that way unless someone objects.

This means that I have also turned off sender address rewriting.

-- 
Cheers,
Stephen Rothwell

test, please ignore again

Just like last time.

-- 
Cheers,
Stephen Rothwell

test, please ignore

I am just testing the interaction of the mailing list with DKIM after removing 
the footer.

-- 
Cheers,
Stephen Rothwell

Re: [PATCH] powernv: Search for new flash DT node location

Jack Miller  writes:

> On Wed, Aug 03, 2016 at 05:16:34PM +1000, Michael Ellerman wrote:
>> We could instead just search for all nodes that are compatible with
>> "ibm,opal-flash". We do that for i2c, see opal_i2c_create_devs().
>> 
>> Is there a particular reason not to do that?
>
> I'm actually surprised that this is preferred. Jeremy mentioned something
> similar, but I guess I just don't like the idea of finding devices in weird
> places in the tree.

But where is "weird". Arguably "/opal/flash" is weird. What does it
mean? There's a bus called "opal" and a device on it called "flash"? No.

Point being the structure is fairly arbitrary, or at least debatable, so
tying the code 100% to the structure is inflexible. As we have discovered.

Our other option is to tell skiboot to get stuffed, and leave the flash
node where it was on P8.

> Then again, if we can't trust the DT we're in bigger
> trouble than erroneous flash nodes =).

Quite :)

> If we really just want to find compatible nodes anywhere, let's simplify i2c
> and pdev_init into one function and make that behavior consistent with this
> new patch.

That seems OK to me.

We should get an ack from Stewart though for the other node types.

cheers
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: move hmi.c to arch/powerpc/kvm/

2016-08-03 Thread Daniel Axtens

Paolo Bonzini  writes:

> hmi.c functions are unused unless sibling_subcore_state is nonzero, and
> that in turn happens only if KVM is in use.  So move the code to
> arch/powerpc/kvm/, putting it under CONFIG_KVM_BOOK3S_64_HANDLER
> rather than CONFIG_PPC_BOOK3S_64.  The sibling_subcore_state is also
> included in struct paca_struct only if KVM is supported by the kernel.

Ok. Initially I was concerned because there are a bunch of non-KVM
related HMI causes (e.g. the CAPP will raise an HMI if it loses the link
to the CAPI card.)
https://github.com/open-power/skiboot/blob/master/core/hmi.c lists lots
of HMIs created by hardware events.

Having said that, you're right that this particular file is KVM
specific.

Reviewed-by: Daniel Axtens 

Mahesh: is there a way to cause the TB to desynchronise and then test if
this resynchronisation works?

Regards,
Daniel

>
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Mahesh Salgaonkar 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: kvm-...@vger.kernel.org
> Cc: k...@vger.kernel.org
> Signed-off-by: Paolo Bonzini 
> ---
>   It would be nice to have this in 4.8, to minimize any 4.9 conflicts.
>   Build-tested only, with and without KVM enabled.
>
>  arch/powerpc/include/asm/hmi.h |  2 +-
>  arch/powerpc/include/asm/paca.h| 10 +-
>  arch/powerpc/kernel/Makefile   |  2 +-
>  arch/powerpc/kvm/Makefile  |  1 +
>  arch/powerpc/{kernel/hmi.c => kvm/book3s_hv_hmi.c} |  0
>  5 files changed, 8 insertions(+), 7 deletions(-)
>  rename arch/powerpc/{kernel/hmi.c => kvm/book3s_hv_hmi.c} (100%)
>
> diff --git a/arch/powerpc/include/asm/hmi.h b/arch/powerpc/include/asm/hmi.h
> index 88b4901ac4ee..d3b6ad6e137c 100644
> --- a/arch/powerpc/include/asm/hmi.h
> +++ b/arch/powerpc/include/asm/hmi.h
> @@ -21,7 +21,7 @@
>  #ifndef __ASM_PPC64_HMI_H__
>  #define __ASM_PPC64_HMI_H__
>  
> -#ifdef CONFIG_PPC_BOOK3S_64
> +#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
>  
>  #define  CORE_TB_RESYNC_REQ_BIT  63
>  #define MAX_SUBCORE_PER_CORE 4
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index 148303e7771f..625321e7e581 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -183,11 +183,6 @@ struct paca_struct {
>*/
>   u16 in_mce;
>   u8 hmi_event_available;  /* HMI event is available */
> - /*
> -  * Bitmap for sibling subcore status. See kvm/book3s_hv_ras.c for
> -  * more details
> -  */
> - struct sibling_subcore_state *sibling_subcore_state;
>  #endif
>  
>   /* Stuff for accurate time accounting */
> @@ -202,6 +197,11 @@ struct paca_struct {
>   struct kvmppc_book3s_shadow_vcpu shadow_vcpu;
>  #endif
>   struct kvmppc_host_state kvm_hstate;
> + /*
> +  * Bitmap for sibling subcore status. See kvm/book3s_hv_ras.c for
> +  * more details
> +  */
> + struct sibling_subcore_state *sibling_subcore_state;
>  #endif
>  };
>  
> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> index b2027a5cf508..fe4c075bcf50 100644
> --- a/arch/powerpc/kernel/Makefile
> +++ b/arch/powerpc/kernel/Makefile
> @@ -41,7 +41,7 @@ obj-$(CONFIG_VDSO32)+= vdso32/
>  obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
>  obj-$(CONFIG_PPC_BOOK3S_64)  += cpu_setup_ppc970.o cpu_setup_pa6t.o
>  obj-$(CONFIG_PPC_BOOK3S_64)  += cpu_setup_power.o
> -obj-$(CONFIG_PPC_BOOK3S_64)  += mce.o mce_power.o hmi.o
> +obj-$(CONFIG_PPC_BOOK3S_64)  += mce.o mce_power.o
>  obj-$(CONFIG_PPC_BOOK3E_64)  += exceptions-64e.o idle_book3e.o
>  obj-$(CONFIG_PPC64)  += vdso64/
>  obj-$(CONFIG_ALTIVEC)+= vecemu.o
> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index 1f9e5529e692..855d4b95d752 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -78,6 +78,7 @@ kvm-book3s_64-builtin-xics-objs-$(CONFIG_KVM_XICS) := \
>  
>  ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>  kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HANDLER) += \
> + book3s_hv_hmi.o \
>   book3s_hv_rmhandlers.o \
>   book3s_hv_rm_mmu.o \
>   book3s_hv_ras.o \
> diff --git a/arch/powerpc/kernel/hmi.c b/arch/powerpc/kvm/book3s_hv_hmi.c
> similarity index 100%
> rename from arch/powerpc/kernel/hmi.c
> rename to arch/powerpc/kvm/book3s_hv_hmi.c
> -- 
> 1.8.3.1
>
> ___
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev


signature.asc
Description: PGP signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures

Hi Arnd,

On Wed, 03 Aug 2016 20:52:48 +0200 Arnd Bergmann  wrote:
>
> Most of the difference appears to be in branch trampolines (634 added,
> 559 removed, 14837 unchanged) as you suspect, but I also see a couple
> of symbols show up in vmlinux that were not there before:
> 
> -A __crc_dma_noop_ops
> -D dma_noop_ops
> -R __clz_tab
> -r fdt_errtable
> -r __kcrctab_dma_noop_ops
> -r __kstrtab_dma_noop_ops
> -R __ksymtab_dma_noop_ops
> -t dma_noop_alloc
> -t dma_noop_free
> -t dma_noop_map_page
> -t dma_noop_mapping_error
> -t dma_noop_map_sg
> -t dma_noop_supported
> -T fdt_add_reservemap_entry
> -T fdt_begin_node
> -T fdt_create
> -T fdt_create_empty_tree
> -T fdt_end_node
> -T fdt_finish
> -T fdt_finish_reservemap
> -T fdt_property
> -T fdt_resize
> -T fdt_strerror
> -T find_cpio_data
> 
> From my first look, it seems that all of lib/*.o is now getting linked
> into vmlinux, while we traditionally leave out everything from lib/
> that is not referenced.

You could try removing the --{,no-}whole-archive arguments to ld in
scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh.  Last time I did
that, though, a whole lot of stuff failed to be linked in. (Especially
stuff only referenced by EXPORT_SYMBOL()s, bu that may have been fixed).

> I also see a noticeable overhead in link time, the numbers are for
> a cache-hot rebuild after a successful allyesconfig build, using a
> 24-way Opteron@2.5Ghz, just relinking vmlinux:

I was afraid of that, but it is offset by the time saved by not doing
the "ld -r"s along the way?  It may also be that (for powerpc anyway)
the linker is doing a better job.

-- 
Cheers,
Stephen Rothwell
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 00/14] Present useful limits to user (v2)

2016-08-03 Thread Topi Miettinen

Hello,

I'm trying the systemtap approach and it looks promising. The script is
annotating strace-like output with capability, device access and RLIMIT
information. In the end there's a summary. Here's sample output from
wpa_supplicant run:

mprotect(0x7efebf14, 16384, PROT_READ) = 0 [DATA 548864 -> 573440]
[AS 44986368 -> 45002752]
brk(0x55d9611f8000) = 94392125718528 missing
[Capabilities=CAP_SYS_ADMIN] [AS 45002752 -> 45010944]
open(0x55d960716462, O_RDWR) = 3 [DeviceAllow=/dev/char/1:3 rw ]
open("/dev/random", O_RDONLY|O_NONBLOCK) = 3 [DeviceAllow=/dev/char/1:8 r ]
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0) = 4
[RestrictAddressFamilies=AF_UNIX] [NOFILE 3 -> 4]
open("/etc/wpa_supplicant.conf", O_RDONLY) = 5 [NOFILE 4 -> 5]
socket(PF_NETLINK, SOCK_RAW, 0) = 5 [RestrictAddressFamilies=AF_NETLINK]
socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, 16) = 6
[RestrictAddressFamilies=AF_NETLINK] [NOFILE 5 -> 6]
socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, 16) = 7
[RestrictAddressFamilies=AF_NETLINK] [NOFILE 6 -> 7]
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 8
[RestrictAddressFamilies=AF_INET] [NOFILE 7 -> 8]
open("/dev/rfkill", O_RDONLY) = 9 [DeviceAllow=/dev/char/10:58 r ]
[NOFILE 8 -> 9]
socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 10
[RestrictAddressFamilies=AF_UNIX] [NOFILE 9 -> 10]
sendmsg(6, 0x7ffc778f35b0, 0x0) = 36 [Capabilities=CAP_NET_ADMIN]

Summary:
CapabilityBoundingSet=CAP_NET_ADMIN CAP_NET_RAW
Consider also missing CapabilityBoundingSet=CAP_SYS_ADMIN
DeviceAllow=/dev/char/1:3 rw
DeviceAllow=/dev/char/1:8 r
DeviceAllow=/dev/char/10:58 r
DeviceAllow=/dev/char/1:9 r
LimitFSIZE=0
LimitDATA=577536
LimitSTACK=139264
LimitCORE=0
LimitNOFILE=15
LimitAS=45146112
LimitNPROC=171
LimitMEMLOCK=0
LimitSIGPENDING=0
LimitMSGQUEUE=0
LimitNICE=0
LimitRTPRIO=0
RestrictAddressFamilies=AF_UNIX AF_INET AF_NETLINK AF_PACKET
MemoryDenyWriteExecute=true

Some values are not correct. NPROC is wrong because staprun needs to be
run as root instead of the separate privileged user for wpa_supplicant
and that messes user process count. DATA/AS/STACK seems to be a bit off.
I can easily use this as systemd service configuration drop-in otherwise.

Now, the relevant part for the kernel is that I'd like to analyze error
paths better, so the system calls would be also annotated when there's a
failure when a RLIMIT is too tight. It would be easier to insert probes
if there was only one path for RLIMIT checks. Would it be OK to make the
function task_rlimit() a full check against the limit and also make it a
non-inlined function, just for improved probing purposes?

There's already error analysis for the capabilities, but there are some
false positive hits (like brk() complaining about missing CAP_SYS_ADMIN
above).

-Topi

#! /bin/sh

# suppress some run-time errors here for cleaner output
//bin/true && exec stap --suppress-handler-errors --skip-badvars $0 ${1+"$@"}

/*
 * Compile:
 * stap -p4 -DSTP_NO_OVERLOAD -m strace
 * Run:
 * /usr/bin/staprun -R -c "/sbin/wpa_supplicant -u -O /run/wpa_supplicant -c 
/etc/wpa_supplicant.conf -i wlan0" -w /root/strace.ko only_capability_use=1 
timestamp=0
 */

/* configuration options; set these with stap -G */
global follow_fork = 0   /* -Gfollow_fork=1 means trace descendant processes 
too */
global timestamp = 1 /* -Gtimestamp=0 means don't print a syscall timestamp 
*/
global elapsed_time = 0  /* -Gelapsed_time=1 means print a syscall duration too 
*/
global only_capability_use = 0 /* -Gonly_capability_use=1 means print only when 
capabilities are used */
global thread_argstr%
global thread_time%

global syscalls_nonreturn[2]
global capnames[64]
global used_caps
global missing_caps
global all_used_caps
global all_missing_caps
global accessed_devices[1000]
global all_accessed_devices[1000]
global highwatermark_fsize
global highwatermark_data
global highwatermark_stack
global highwatermark_core
global highwatermark_nproc
global highwatermark_nofile
global highwatermark_memlock
global highwatermark_as
global highwatermark_sigpending
global highwatermark_msgqueue
global highwatermark_nice
global highwatermark_rtprio
global old_highwatermark_fsize
global old_highwatermark_data
global old_highwatermark_stack
global old_highwatermark_core
global old_highwatermark_nproc
global old_highwatermark_nofile
global old_highwatermark_memlock
global old_highwatermark_as
global old_highwatermark_sigpending
global old_highwatermark_msgqueue
global old_highwatermark_nice
global old_highwatermark_rtprio
global afnames[64]
global used_afs
global missing_afs
global all_used_afs
global all_missing_afs
global no_memory_deny_write_execute
global all_memory_deny_write_execute = "true"
global print_syscall


probe begin 
  {
/* list those syscalls that never .return */
syscalls_nonreturn["exit"]=1
syscalls_nonreturn["exit_group"]=1

// grep '#define CAP_.*[0-9]+$' 
/usr/src/linux-headers*/include/uapi/linux/capability.h | awk '{ print 
"capnames[" $3 "] = \"" $2 "\";" }'
capnames[0] = "CAP_CHOWN";

Re: [PATCH] powerpc: convert 'iommu_alloc failed' messages to dynamic debug

2016-08-03 Thread Mauricio Faria de Oliveira


On 08/03/2016 06:34 PM, Benjamin Herrenschmidt wrote:

I think this is best done by the relevant community maintainer,
I just threw an idea but I'm not that familiar with the details:-)


Ok, sure; got it.


Did you send them to the lkml list ?


Yup, plus a few others lists from get_maintainer.pl iirc.

Mailing list archive links:
- linux-kernel: http://marc.info/?l=linux-kernel&m=146798084822100&w=2
- linux-doc: http://marc.info/?l=linux-doc&m=146798085522104&w=2
- linux-nvme: 
http://lists.infradead.org/pipermail/linux-nvme/2016-July/005349.html
- linuxppc-dev: 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-July/145624.html


Thanks,


--
Mauricio Faria de Oliveira
IBM Linux Technology Center

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] ibmvfc: Set READ FCP_XFER_READY DISABLED bit in PRLI

2016-08-03 Thread Tyrel Datwyler

The READ FCP_XFER_READY DISABLED bit is required to always be set to
one since FCP-3. Set it in the service parameter page frame during
process login.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index ab67ec4..4a680ce 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -3381,6 +3381,7 @@ static void ibmvfc_tgt_send_prli(struct ibmvfc_target 
*tgt)
prli->parms.type = IBMVFC_SCSI_FCP_TYPE;
prli->parms.flags = cpu_to_be16(IBMVFC_PRLI_EST_IMG_PAIR);
prli->parms.service_parms = cpu_to_be32(IBMVFC_PRLI_INITIATOR_FUNC);
+   prli->parms.service_parms |= 
cpu_to_be32(IBMVFC_PRLI_READ_FCP_XFER_RDY_DISABLED);
 
ibmvfc_set_tgt_action(tgt, IBMVFC_TGT_ACTION_INIT_WAIT);
if (ibmvfc_send_event(evt, vhost, default_timeout)) {
-- 
2.7.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] ibmvfc: add FC Class 3 Error Recovery support

2016-08-03 Thread Tyrel Datwyler

The ibmvfc driver currently doesn't support FC Class 3 Error Recovery.
However, it is simply a matter of informing the VIOS that the payload
expects to use sequence level error recovery via a bit flag in the
ibmvfc_cmd structure.

This patch adds a module parameter to enable error recovery support
at boot time. When enabled the RETRY service parameter bit is set
during PRLI, and ibmvfc_cmd->flags includes the IBMVFC_CLASS_3_ERR
bit.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 10 ++
 drivers/scsi/ibmvscsi/ibmvfc.h |  1 +
 2 files changed, 11 insertions(+)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 4a680ce..6b92169 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -52,6 +52,7 @@ static unsigned int max_requests = 
IBMVFC_MAX_REQUESTS_DEFAULT;
 static unsigned int disc_threads = IBMVFC_MAX_DISC_THREADS;
 static unsigned int ibmvfc_debug = IBMVFC_DEBUG;
 static unsigned int log_level = IBMVFC_DEFAULT_LOG_LEVEL;
+static unsigned int cls3_error = IBMVFC_CLS3_ERROR;
 static LIST_HEAD(ibmvfc_head);
 static DEFINE_SPINLOCK(ibmvfc_driver_lock);
 static struct scsi_transport_template *ibmvfc_transport_template;
@@ -86,6 +87,9 @@ MODULE_PARM_DESC(debug, "Enable driver debug information. "
 module_param_named(log_level, log_level, uint, 0);
 MODULE_PARM_DESC(log_level, "Set to 0 - 4 for increasing verbosity of device 
driver. "
 "[Default=" __stringify(IBMVFC_DEFAULT_LOG_LEVEL) "]");
+module_param_named(cls3_error, cls3_error, uint, 0);
+MODULE_PARM_DESC(log_level, "Enable FC Class 3 Error Recovery. "
+"[Default=" __stringify(IBMVFC_CLS3_ERROR) "]");
 
 static const struct {
u16 status;
@@ -1335,6 +1339,9 @@ static int ibmvfc_map_sg_data(struct scsi_cmnd *scmd,
struct srp_direct_buf *data = &vfc_cmd->ioba;
struct ibmvfc_host *vhost = dev_get_drvdata(dev);
 
+   if (cls3_error)
+   vfc_cmd->flags |= cpu_to_be16(IBMVFC_CLASS_3_ERR);
+
sg_mapped = scsi_dma_map(scmd);
if (!sg_mapped) {
vfc_cmd->flags |= cpu_to_be16(IBMVFC_NO_MEM_DESC);
@@ -3383,6 +3390,9 @@ static void ibmvfc_tgt_send_prli(struct ibmvfc_target 
*tgt)
prli->parms.service_parms = cpu_to_be32(IBMVFC_PRLI_INITIATOR_FUNC);
prli->parms.service_parms |= 
cpu_to_be32(IBMVFC_PRLI_READ_FCP_XFER_RDY_DISABLED);
 
+   if (cls3_error)
+   prli->parms.service_parms |= cpu_to_be32(IBMVFC_PRLI_RETRY);
+
ibmvfc_set_tgt_action(tgt, IBMVFC_TGT_ACTION_INIT_WAIT);
if (ibmvfc_send_event(evt, vhost, default_timeout)) {
vhost->discovery_threads--;
diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h
index 8fae032..7f9bb07 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.h
+++ b/drivers/scsi/ibmvscsi/ibmvfc.h
@@ -54,6 +54,7 @@
 #define IBMVFC_DEV_LOSS_TMO(5 * 60)
 #define IBMVFC_DEFAULT_LOG_LEVEL   2
 #define IBMVFC_MAX_CDB_LEN 16
+#define IBMVFC_CLS3_ERROR  0
 
 /*
  * Ensure we have resources for ERP and initialization:
-- 
2.7.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/2] ibmvfc: FC-TAPE Support

2016-08-03 Thread Tyrel Datwyler

This patchset introduces optional FC-TAPE/FC Class 3 Error Recovery to the
ibmvfc client driver.

Tyrel Datwyler (2):
  ibmvfc: Set READ FCP_XFER_READY DISABLED bit in PRLI
  ibmvfc: add FC Class 3 Error Recovery support

 drivers/scsi/ibmvscsi/ibmvfc.c | 11 +++
 drivers/scsi/ibmvscsi/ibmvfc.h |  1 +
 2 files changed, 12 insertions(+)

-- 
2.7.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: convert 'iommu_alloc failed' messages to dynamic debug

2016-08-03 Thread Benjamin Herrenschmidt

On Wed, 2016-08-03 at 16:39 -0300, Mauricio Faria de Oliveira wrote:
> Hi Ben,
> 
> On 06/13/2016 06:26 PM, Benjamin Herrenschmidt wrote:
> > 
> > Another option would be to use a dma_attr for silencing mapping
> > errors
> > which NVME could use provided it does handle them gracefully ...
> 
> I recently submitted patches that implement your suggestion [1].
> May you please review/comment if they're OK with you?

I think this is best done by the relevant community maintainer,
I just threw an idea but I'm not that familiar with the details :-)

Did you send them to the lkml list ?

> Thanks!
> 
> [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-August/14685
> 0.html
> 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures

On Wednesday, August 3, 2016 2:44:29 PM CEST Segher Boessenkool wrote:
> Hi Arnd,
> 
> On Wed, Aug 03, 2016 at 08:52:48PM +0200, Arnd Bergmann wrote:
> > From my first look, it seems that all of lib/*.o is now getting linked
> > into vmlinux, while we traditionally leave out everything from lib/
> > that is not referenced.
> > 
> > I also see a noticeable overhead in link time, the numbers are for
> > a cache-hot rebuild after a successful allyesconfig build, using a
> > 24-way Opteron@2.5Ghz, just relinking vmlinux:
> > 
> > $ time make skj30 vmlinux # before
> > real2m8.092s
> > user3m41.008s
> > sys 0m48.172s
> > 
> > $ time make skj30 vmlinux # after
> > real4m10.189s
> > user5m43.804s
> > sys 0m52.988s
> 
> Is it better when using rcT instead of rcsT?

It seems to be noticeably better for the clean rebuild case, though
not as good as the original:

real3m34.015s
user5m7.104s
sys 0m49.172s

I've also tried now with my own patch applied as well (linking
each drivers/*/built-in.o into vmlinux rather than having them
linked into drivers/built-in.o first), but that makes no
difference.

Arnd
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures

2016-08-03 Thread Segher Boessenkool

Hi Arnd,

On Wed, Aug 03, 2016 at 08:52:48PM +0200, Arnd Bergmann wrote:
> From my first look, it seems that all of lib/*.o is now getting linked
> into vmlinux, while we traditionally leave out everything from lib/
> that is not referenced.
> 
> I also see a noticeable overhead in link time, the numbers are for
> a cache-hot rebuild after a successful allyesconfig build, using a
> 24-way Opteron@2.5Ghz, just relinking vmlinux:
> 
> $ time make skj30 vmlinux # before
> real  2m8.092s
> user  3m41.008s
> sys   0m48.172s
> 
> $ time make skj30 vmlinux # after
> real  4m10.189s
> user  5m43.804s
> sys   0m52.988s

Is it better when using rcT instead of rcsT?


Segher
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: convert 'iommu_alloc failed' messages to dynamic debug

2016-08-03 Thread Mauricio Faria de Oliveira


Hi Ben,

On 06/13/2016 06:26 PM, Benjamin Herrenschmidt wrote:

Another option would be to use a dma_attr for silencing mapping errors
which NVME could use provided it does handle them gracefully ...


I recently submitted patches that implement your suggestion [1].
May you please review/comment if they're OK with you?

Thanks!

[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-August/146850.html

--
Mauricio Faria de Oliveira
IBM Linux Technology Center

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RESEND][PATCH v2 2/2] powerpc/fadump: parse fadump reserve memory size based on memory range

2016-08-03 Thread Hari Bathini

Currently, memory for fadump can be specified with fadump_reserve_mem=size,
where only a fixed size can be specified. Add the below syntax as well, to
support conditional reservation based on system memory size:

fadump_reserve_mem=:[,:,...]

This syntax helps using the same commandline parameter for different system
memory sizes.

Signed-off-by: Hari Bathini 
Reviewed-by: Mahesh J Salgaonkar 
---
 arch/powerpc/kernel/fadump.c |   64 --
 1 file changed, 55 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index b3a6633..4661ae6 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -193,6 +193,56 @@ static unsigned long init_fadump_mem_struct(struct 
fadump_mem_struct *fdm,
return addr;
 }
 
+/*
+ * This function parses command line for fadump_reserve_mem=
+ *
+ * Supports the below two syntaxes:
+ *1. fadump_reserve_mem=size
+ *2. fadump_reserve_mem=ramsize-range:size[,...]
+ *
+ * Sets fw_dump.reserve_bootvar with the memory size
+ * provided, 0 otherwise
+ *
+ * The function returns -EINVAL on failure, 0 otherwise.
+ */
+static int __init parse_fadump_reserve_mem(void)
+{
+   char *name = "fadump_reserve_mem=";
+   char *fadump_cmdline = NULL, *cur;
+
+   fw_dump.reserve_bootvar = 0;
+
+   /* find fadump_reserve_mem and use the last one if there are many */
+   cur = strstr(boot_command_line, name);
+   while (cur) {
+   fadump_cmdline = cur;
+   cur = strstr(cur+1, name);
+   }
+
+   /* when no fadump_reserve_mem= cmdline option is provided */
+   if (!fadump_cmdline)
+   return 0;
+
+   fadump_cmdline += strlen(name);
+
+   /* for fadump_reserve_mem=size cmdline syntax */
+   if (!is_param_range_based(fadump_cmdline)) {
+   fw_dump.reserve_bootvar = memparse(fadump_cmdline, NULL);
+   return 0;
+   }
+
+   /* for fadump_reserve_mem=ramsize-range:size[,...] cmdline syntax */
+   cur = fadump_cmdline;
+   fw_dump.reserve_bootvar = parse_mem_range_size("fadump_reserve_mem",
+   &cur, memblock_phys_mem_size());
+   if (cur == fadump_cmdline) {
+   printk(KERN_INFO "fadump_reserve_mem: Invaild syntax!\n");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 /**
  * fadump_calculate_reserve_size(): reserve variable boot area 5% of System RAM
  *
@@ -212,12 +262,17 @@ static inline unsigned long 
fadump_calculate_reserve_size(void)
 {
unsigned long size;
 
+   /* sets fw_dump.reserve_bootvar */
+   parse_fadump_reserve_mem();
+
/*
 * Check if the size is specified through fadump_reserve_mem= cmdline
 * option. If yes, then use that.
 */
if (fw_dump.reserve_bootvar)
return fw_dump.reserve_bootvar;
+   else
+   printk(KERN_INFO "fadump: calculating default boot size\n");
 
/* divide by 20 to get 5% of value */
size = memblock_end_of_DRAM() / 20;
@@ -348,15 +403,6 @@ static int __init early_fadump_param(char *p)
 }
 early_param("fadump", early_fadump_param);
 
-/* Look for fadump_reserve_mem= cmdline option */
-static int __init early_fadump_reserve_mem(char *p)
-{
-   if (p)
-   fw_dump.reserve_bootvar = memparse(p, &p);
-   return 0;
-}
-early_param("fadump_reserve_mem", early_fadump_reserve_mem);
-
 static void register_fw_dump(struct fadump_mem_struct *fdm)
 {
int rc;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RESEND][PATCH v2 1/2] kexec: refactor code parsing size based on memory range

2016-08-03 Thread Hari Bathini

crashkernel parameter supports different syntaxes to specify the amount
of memory to be reserved for kdump kernel. Below is one of the supported
syntaxes that needs parsing to find the memory size to reserve, based on
memory range:

crashkernel=:[,:,...]

While such parsing is implemented for crashkernel parameter, it applies
to other parameters, like fadump_reserve_mem=, which could use similar
syntax. This patch moves crashkernel's parsing code for above syntax to
to kernel/params.c file for reuse. Two functions is_param_range_based()
and parse_mem_range_size() are added to kernel/params.c file for this
purpose.

Any parameter that uses the above syntax can use is_param_range_based()
function to validate the syntax and parse_mem_range_size() function to
get the parsed memory size. While some code is moved to kernel/params.c
file, there is no change functionality wise in parsing the crashkernel
parameter.

Signed-off-by: Hari Bathini 
---

Changes from v1:
1. Updated changelog

 include/linux/kernel.h |5 +++
 kernel/kexec_core.c|   63 +++-
 kernel/params.c|   96 
 3 files changed, 106 insertions(+), 58 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index d96a611..2df7ba2 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -435,6 +435,11 @@ extern char *get_options(const char *str, int nints, int 
*ints);
 extern unsigned long long memparse(const char *ptr, char **retptr);
 extern bool parse_option_str(const char *str, const char *option);
 
+extern bool __init is_param_range_based(const char *cmdline);
+extern unsigned long long __init parse_mem_range_size(const char *param,
+ char **str,
+ unsigned long long 
system_ram);
+
 extern int core_kernel_text(unsigned long addr);
 extern int core_kernel_data(unsigned long addr);
 extern int __kernel_text_address(unsigned long addr);
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 5616755..3a74024 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1104,59 +1104,9 @@ static int __init parse_crashkernel_mem(char *cmdline,
char *cur = cmdline, *tmp;
 
/* for each entry of the comma-separated list */
-   do {
-   unsigned long long start, end = ULLONG_MAX, size;
-
-   /* get the start of the range */
-   start = memparse(cur, &tmp);
-   if (cur == tmp) {
-   pr_warn("crashkernel: Memory value expected\n");
-   return -EINVAL;
-   }
-   cur = tmp;
-   if (*cur != '-') {
-   pr_warn("crashkernel: '-' expected\n");
-   return -EINVAL;
-   }
-   cur++;
-
-   /* if no ':' is here, than we read the end */
-   if (*cur != ':') {
-   end = memparse(cur, &tmp);
-   if (cur == tmp) {
-   pr_warn("crashkernel: Memory value expected\n");
-   return -EINVAL;
-   }
-   cur = tmp;
-   if (end <= start) {
-   pr_warn("crashkernel: end <= start\n");
-   return -EINVAL;
-   }
-   }
-
-   if (*cur != ':') {
-   pr_warn("crashkernel: ':' expected\n");
-   return -EINVAL;
-   }
-   cur++;
-
-   size = memparse(cur, &tmp);
-   if (cur == tmp) {
-   pr_warn("Memory value expected\n");
-   return -EINVAL;
-   }
-   cur = tmp;
-   if (size >= system_ram) {
-   pr_warn("crashkernel: invalid size\n");
-   return -EINVAL;
-   }
-
-   /* match ? */
-   if (system_ram >= start && system_ram < end) {
-   *crash_size = size;
-   break;
-   }
-   } while (*cur++ == ',');
+   *crash_size = parse_mem_range_size("crashkernel", &cur, system_ram);
+   if (cur == cmdline)
+   return -EINVAL;
 
if (*crash_size > 0) {
while (*cur && *cur != ' ' && *cur != '@')
@@ -1293,7 +1243,6 @@ static int __init __parse_crashkernel(char *cmdline,
 const char *name,
 const char *suffix)
 {
-   char*first_colon, *first_space;
char*ck_cmdline;
 
BUG_ON(!crash_size || !crash_base);
@@ -1311,12 +1260,10 @@ static int __init __parse_crashkernel(char *cmdline,
return parse_crashkernel_suffix(ck_cmdline, crash_size,
s

[RESEND][PATCH v2 0/2] powerpc/fadump: support memory range syntax for fadump memory reservation

2016-08-03 Thread Hari Bathini

This patchset adds support to input system memory range based memory size
for fadump reservation. The crashkernel parameter already supports such
syntax. The first patch refactors the parsing code of crashkernel parameter
for reuse. The second patch uses the newly refactored parsing code to reserve
memory for fadump based on system memory size.

---

Hari Bathini (2):
  kexec: refactor code parsing size based on memory range
  powerpc/fadump: parse fadump reserve memory size based on memory range


 arch/powerpc/kernel/fadump.c |   64 
 include/linux/kernel.h   |5 ++
 kernel/kexec_core.c  |   63 ++--
 kernel/params.c  |   96 ++
 4 files changed, 161 insertions(+), 67 deletions(-)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] lkdtm: Mark lkdtm_rodata_do_nothing() notrace

2016-08-03 Thread Kees Cook

On Tue, Aug 2, 2016 at 9:59 PM, Michael Ellerman  wrote:
> lkdtm_rodata_do_nothing() is an empty function which is generated in
> order to test the non-executability of rodata.
>
> Currently if function tracing is enabled then an mcount callsite will be
> generated for lkdtm_rodata_do_nothing(), and it will appear in the list
> of available functions for function tracing (available_filter_functions).
>
> Given it's purpose purely as a test function, it seems preferable for
> lkdtm_rodata_do_nothing() to be marked notrace, so it doesn't appear as
> traceable.
>
> This also avoids triggering a linker bug on powerpc:
>
>   https://sourceware.org/bugzilla/show_bug.cgi?id=20428
>
> When the linker sees code that needs to generate a call stub, eg. a
> branch to mcount(), it assumes the section is executable and
> dereferences a NULL pointer leading to a linker segfault. Marking
> lkdtm_rodata_do_nothing() notrace avoids triggering the bug because the
> function contains no other function calls.
>
> Signed-off-by: Michael Ellerman 

Awesome! Thanks for tracking this down. I've applied it to my tree, it
should get picked up by Greg on my next pull request.

-Kees

> ---
>  drivers/misc/lkdtm_rodata.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/misc/lkdtm_rodata.c b/drivers/misc/lkdtm_rodata.c
> index 166b1db3969f..3564477b8c2d 100644
> --- a/drivers/misc/lkdtm_rodata.c
> +++ b/drivers/misc/lkdtm_rodata.c
> @@ -4,7 +4,7 @@
>   */
>  #include "lkdtm.h"
>
> -void lkdtm_rodata_do_nothing(void)
> +void notrace lkdtm_rodata_do_nothing(void)
>  {
> /* Does nothing. We just want an architecture agnostic "return". */
>  }
> --
> 2.7.4
>



-- 
Kees Cook
Brillo & Chrome OS Security
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures

On Thursday, August 4, 2016 1:37:29 AM CEST Nicholas Piggin wrote:
> 
> I've attached what I'm using, which builds and runs for me without
> any work. Your arch obviously has to select the option to use it.
> 
> text  data bss  dec   hex filename
> 11196784  1185024  1923820  14305628  da495c  vmlinuxppc64.before
> 11187536  1181848  1923176  14292560  da1650  vmlinuxppc64.after
> 
> ~9K text saving, ~3K data saving. I assume this comes from fewer
> branch trampolines and toc entries, but haven't verified exactly.

The patch seems to work great, but for me it's getting bigger
(compared to my older patch, mainline allyesconfig doesn't build):

   textdata bss dec hex filename
512998684259955923362148117261575   6fd4507 
vmlinuxarm.before
513025454259501523361884117259444   6fd3cb4 
vmlinuxarm.after

Most of the difference appears to be in branch trampolines (634 added,
559 removed, 14837 unchanged) as you suspect, but I also see a couple
of symbols show up in vmlinux that were not there before:

-A __crc_dma_noop_ops
-D dma_noop_ops
-R __clz_tab
-r fdt_errtable
-r __kcrctab_dma_noop_ops
-r __kstrtab_dma_noop_ops
-R __ksymtab_dma_noop_ops
-t dma_noop_alloc
-t dma_noop_free
-t dma_noop_map_page
-t dma_noop_mapping_error
-t dma_noop_map_sg
-t dma_noop_supported
-T fdt_add_reservemap_entry
-T fdt_begin_node
-T fdt_create
-T fdt_create_empty_tree
-T fdt_end_node
-T fdt_finish
-T fdt_finish_reservemap
-T fdt_property
-T fdt_resize
-T fdt_strerror
-T find_cpio_data

From my first look, it seems that all of lib/*.o is now getting linked
into vmlinux, while we traditionally leave out everything from lib/
that is not referenced.

I also see a noticeable overhead in link time, the numbers are for
a cache-hot rebuild after a successful allyesconfig build, using a
24-way Opteron@2.5Ghz, just relinking vmlinux:

$ time make skj30 vmlinux # before
real2m8.092s
user3m41.008s
sys 0m48.172s

$ time make skj30 vmlinux # after
real4m10.189s
user5m43.804s
sys 0m52.988s

That is clearly a very sharp difference. Fortunately for the defconfig
build, the times are much lower, and I see no real difference other
than the noise between subsequent runs:

$ time make skj30 vmlinux # before
real0m5.415s
user0m19.716s
sys 0m9.356s
$ time make skj30 vmlinux # before
real0m9.536s
user0m21.320s
sys 0m9.224s


$ time make skj30 vmlinux # after
real0m5.539s
user0m20.360s
sys 0m9.224s

$ time make skj30 vmlinux # after
real0m9.138s
user0m21.932s
sys 0m8.988s

$ time make skj30 vmlinux # after
real0m5.659s
user0m20.332s
sys 0m9.620s

Arnd
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/2] mm: Allow disabling deferred struct page initialisation

2016-08-03 Thread Dave Hansen

On 08/02/2016 11:38 PM, Srikar Dronamraju wrote:
> * Dave Hansen  [2016-08-02 11:09:21]:
>> On 08/02/2016 06:19 AM, Srikar Dronamraju wrote:
>>> Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialise
>>> only certain size memory per node. The certain size takes into account
>>> the dentry and inode cache sizes. However such a kernel when booting a
>>> secondary kernel will not be able to allocate the required amount of
>>> memory to suffice for the dentry and inode caches. This results in
>>> crashes like the below on large systems such as 32 TB systems.
>>
>> What's a "secondary kernel"?
>>
> I mean the kernel thats booted to collect the crash, On fadump, the
> first kernel acts as the secondary kernel i.e the same kernel is booted
> to collect the crash.

OK, but I'm still not seeing what the problem is.  You've said that it
crashes and that it crashes during inode/dentry cache allocation.

But, *why* does the same kernel image crash in when it is used as a
"secondary kernel"?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] powernv: Simplify searching for compatible device nodes

2016-08-03 Thread Jack Miller

(rebased on powerpc/next)

This condenses the opal node searching into a single function that finds
all compatible nodes, instead of just searching the ibm,opal children,
for ipmi, flash, and prd similar to how opal-i2c nodes are found.

Signed-off-by: Jack Miller 
---
 arch/powerpc/platforms/powernv/opal.c | 24 +++-
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index 8b4fc68..9db12ce 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -631,21 +631,11 @@ static void __init opal_dump_region_init(void)
"rc = %d\n", rc);
 }
 
-static void opal_pdev_init(struct device_node *opal_node,
-   const char *compatible)
+static void opal_pdev_init(const char *compatible)
 {
struct device_node *np;
 
-   for_each_child_of_node(opal_node, np)
-   if (of_device_is_compatible(np, compatible))
-   of_platform_device_create(np, NULL, NULL);
-}
-
-static void opal_i2c_create_devs(void)
-{
-   struct device_node *np;
-
-   for_each_compatible_node(np, NULL, "ibm,opal-i2c")
+   for_each_compatible_node(np, NULL, compatible)
of_platform_device_create(np, NULL, NULL);
 }
 
@@ -717,7 +707,7 @@ static int __init opal_init(void)
opal_hmi_handler_init();
 
/* Create i2c platform devices */
-   opal_i2c_create_devs();
+   opal_pdev_init("ibm,opal-i2c");
 
/* Setup a heatbeat thread if requested by OPAL */
opal_init_heartbeat();
@@ -752,12 +742,12 @@ static int __init opal_init(void)
}
 
/* Initialize platform devices: IPMI backend, PRD & flash interface */
-   opal_pdev_init(opal_node, "ibm,opal-ipmi");
-   opal_pdev_init(opal_node, "ibm,opal-flash");
-   opal_pdev_init(opal_node, "ibm,opal-prd");
+   opal_pdev_init("ibm,opal-ipmi");
+   opal_pdev_init("ibm,opal-flash");
+   opal_pdev_init("ibm,opal-prd");
 
/* Initialise platform device: oppanel interface */
-   opal_pdev_init(opal_node, "ibm,opal-oppanel");
+   opal_pdev_init("ibm,opal-oppanel");
 
/* Initialise OPAL kmsg dumper for flushing console on panic */
opal_kmsg_init();
-- 
2.9.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powernv: Simplify searching for compatible device nodes

2016-08-03 Thread Jack Miller

This condenses the opal node searching into a single function that finds
all compatible nodes, instead of just searching the ibm,opal children,
for ipmi, flash, and prd similar to how opal-i2c nodes are found.

Signed-off-by: Jack Miller 
---
 arch/powerpc/platforms/powernv/opal.c | 22 ++
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index ae29eaf..86b7352 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -640,21 +640,11 @@ static void __init opal_dump_region_init(void)
"rc = %d\n", rc);
 }
 
-static void opal_pdev_init(struct device_node *opal_node,
-   const char *compatible)
+static void opal_pdev_init(const char *compatible)
 {
struct device_node *np;
 
-   for_each_child_of_node(opal_node, np)
-   if (of_device_is_compatible(np, compatible))
-   of_platform_device_create(np, NULL, NULL);
-}
-
-static void opal_i2c_create_devs(void)
-{
-   struct device_node *np;
-
-   for_each_compatible_node(np, NULL, "ibm,opal-i2c")
+   for_each_compatible_node(np, NULL, compatible)
of_platform_device_create(np, NULL, NULL);
 }
 
@@ -722,7 +712,7 @@ static int __init opal_init(void)
opal_hmi_handler_init();
 
/* Create i2c platform devices */
-   opal_i2c_create_devs();
+   opal_pdev_init("ibm,opal-i2c");
 
/* Setup a heatbeat thread if requested by OPAL */
opal_init_heartbeat();
@@ -754,9 +744,9 @@ static int __init opal_init(void)
}
 
/* Initialize platform devices: IPMI backend, PRD & flash interface */
-   opal_pdev_init(opal_node, "ibm,opal-ipmi");
-   opal_pdev_init(opal_node, "ibm,opal-flash");
-   opal_pdev_init(opal_node, "ibm,opal-prd");
+   opal_pdev_init("ibm,opal-ipmi");
+   opal_pdev_init("ibm,opal-flash");
+   opal_pdev_init("ibm,opal-prd");
 
/* Initialise OPAL kmsg dumper for flushing console on panic */
opal_kmsg_init();
-- 
2.9.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powernv: Search for new flash DT node location

2016-08-03 Thread Jack Miller

On Wed, Aug 03, 2016 at 05:16:34PM +1000, Michael Ellerman wrote:
> We could instead just search for all nodes that are compatible with
> "ibm,opal-flash". We do that for i2c, see opal_i2c_create_devs().
> 
> Is there a particular reason not to do that?

I'm actually surprised that this is preferred. Jeremy mentioned something
similar, but I guess I just don't like the idea of finding devices in weird
places in the tree. Then again, if we can't trust the DT we're in bigger
trouble than erroneous flash nodes =).

If we really just want to find compatible nodes anywhere, let's simplify i2c
and pdev_init into one function and make that behavior consistent with this
new patch.

- Jack

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures

2016-08-03 Thread Nicholas Piggin

On Wed, 03 Aug 2016 14:29:13 +0200
Arnd Bergmann  wrote:

> On Wednesday, August 3, 2016 10:19:11 PM CEST Stephen Rothwell wrote:
> > Hi Arnd,
> > 
> > On Wed, 03 Aug 2016 09:52:23 +0200 Arnd Bergmann  wrote:  
> > >
> > > Using a different way to link the kernel would also help us with
> > > the remaining allyesconfig problem on ARM, as the problem is only in
> > > 'ld -r' not producing trampolines for symbols that later cannot get
> > > them any more. It would probably also help building with ld.gold,
> > > which is currently not working.
> > > 
> > > What is your suggested alternative?  
> > 
> > I have a patch that make the built-in.o files into thin archives (same
> > as archives, but the actual objects are replaced with the name of the
> > original object file).  That way the final link has all the original
> > objects.  I haven't checked to see what the overheads of doing it this
> > way is.
> > 
> > Nick Piggin has just today taken my old patch (it was last rebased to
> > v4.4-rc1) and tried it on a recent kernel and it still seems to mostly
> > work.  It probably needs some tidying up, but you are welcome to test
> > it if you want to.  
> 
> Sure, I'll certainly give it a try on ARM when you send me a copy.

I've attached what I'm using, which builds and runs for me without
any work. Your arch obviously has to select the option to use it.

text  data bss  dec   hex filename
11196784  1185024  1923820  14305628  da495c  vmlinuxppc64.before
11187536  1181848  1923176  14292560  da1650  vmlinuxppc64.after

~9K text saving, ~3K data saving. I assume this comes from fewer
branch trampolines and toc entries, but haven't verified exactly.



commit 8bc3ca4798c215e9a9107b6d44408f0af259f84f
Author: Stephen Rothwell 
Date:   Tue Oct 30 12:14:18 2012 +1100

kbuild: allow architectures to use thin archives instead of ld -r

Alan Modra has been trying to convince the kernel developers that ld -r
is "evil" for many years.  This is an alternative and means that the
linker has much more information available to it when it links the
kernel.

Signed-off-by: Stephen Rothwell 

diff --git a/arch/Kconfig b/arch/Kconfig
index d794384..1330bf4 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -424,6 +424,12 @@ config CC_STACKPROTECTOR_STRONG
 
 endchoice
 
+config THIN_ARCHIVES
+   bool
+   help
+ Select this if the architecture wants to use thin archives
+ instead of ld -r to create the built-in.o files.
+
 config HAVE_CONTEXT_TRACKING
bool
help
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 0d1ca5b..bbf60b3 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -358,10 +358,15 @@ $(sort $(subdir-obj-y)): $(subdir-ym) ;
 # Rule to compile a set of .o files into one .o file
 #
 ifdef builtin-target
+ifdef CONFIG_THIN_ARCHIVES
+  cmd_make_builtin = rm -f $@; $(AR) rcsT$(KBUILD_ARFLAGS)
+else
+  cmd_make_builtin = $(LD) $(ld_flags) -r -o
+endif
 quiet_cmd_link_o_target = LD  $@
 # If the list of objects to link is empty, just create an empty built-in.o
 cmd_link_o_target = $(if $(strip $(obj-y)),\
- $(LD) $(ld_flags) -r -o $@ $(filter $(obj-y), $^) \
+ $(cmd_make_builtin) $@ $(filter $(obj-y), $^) \
  $(cmd_secanalysis),\
  rm -f $@; $(AR) rcs$(KBUILD_ARFLAGS) $@)
 
diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
index f0f6d9d..ef4658f 100755
--- a/scripts/link-vmlinux.sh
+++ b/scripts/link-vmlinux.sh
@@ -41,8 +41,14 @@ info()
 # ${1} output file
 modpost_link()
 {
-   ${LD} ${LDFLAGS} -r -o ${1} ${KBUILD_VMLINUX_INIT}   \
-   --start-group ${KBUILD_VMLINUX_MAIN} --end-group
+   local objects
+
+   if [ -n "${CONFIG_THIN_ARCHIVES}" ]; then
+   objects="--whole-archive ${KBUILD_VMLINUX_INIT} 
${KBUILD_VMLINUX_MAIN} --no-whole-archive"
+   else
+   objects="${KBUILD_VMLINUX_INIT} --start-group 
${KBUILD_VMLINUX_MAIN} --end-group"
+   fi
+   ${LD} ${LDFLAGS} -r -o ${1} ${objects}
 }
 
 # Link of vmlinux
@@ -51,11 +57,16 @@ modpost_link()
 vmlinux_link()
 {
local lds="${objtree}/${KBUILD_LDS}"
+   local objects
 
if [ "${SRCARCH}" != "um" ]; then
+   if [ -n "${CONFIG_THIN_ARCHIVES}" ]; then
+   objects="--whole-archive ${KBUILD_VMLINUX_INIT} 
${KBUILD_VMLINUX_MAIN} --no-whole-archive"
+   else
+   objects="${KBUILD_VMLINUX_INIT} --start-group 
${KBUILD_VMLINUX_MAIN} --end-group"
+   fi
${LD} ${LDFLAGS} ${LDFLAGS_vmlinux} -o ${2}  \
-   -T ${lds} ${KBUILD_VMLINUX_INIT} \
-   --start-group ${KBUILD_VMLINUX_MAIN} --end-group ${1}
+   -T ${lds} ${objects} ${1}
else
${CC} ${CFLAGS_vmlinux} -o

Re: [v4] Fix to avoid IS_ERR_VALUE and IS_ERR abuses on 64bit systems.

2016-08-03 Thread arvind Yadav




On Wednesday 03 August 2016 01:27 AM, Scott Wood wrote:

On 08/02/2016 10:34 AM, arvind Yadav wrote:


On Tuesday 02 August 2016 01:15 PM, Arnd Bergmann wrote:

On Monday, August 1, 2016 4:55:43 PM CEST Scott Wood wrote:

On 08/01/2016 02:02 AM, Arnd Bergmann wrote:

diff --git a/include/linux/err.h b/include/linux/err.h
index 1e35588..c2a2789 100644
--- a/include/linux/err.h
+++ b/include/linux/err.h
@@ -18,7 +18,17 @@
  
  #ifndef __ASSEMBLY__
  
-#define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned long)-MAX_ERRNO)

+#define IS_ERR_VALUE(x) unlikely(is_error_check(x))
+
+static inline int is_error_check(unsigned long error)

Please leave the existing macro alone. I think you were looking for
something specific to the return code of qe_muram_alloc() function,
so please add a helper in that subsystem if you need it, not in
the generic header files.

qe_muram_alloc (a.k.a. cpm_muram_alloc) returns unsigned long.  The
problem is certain callers that store the return value in a u32.  Why
not just fix those callers to store it in unsigned long (at least until
error checking is done)?


Yes, that would also address another problem with code like

  kfree((void *)ugeth->tx_bd_ring_offset[i]);

which is not 64-bit safe when tx_bd_ring_offset is a 32-bit value
that also holds the return value of qe_muram_alloc.

Well, hopefully it doesn't hold a return of qe_muram_alloc() when it's
being passed to kfree()...

There's also the code that casts kmalloc()'s return to u32, etc.
ucc_geth is not 64-bit clean in general.


Arnd

Yes, we will fix caller. Caller api is not safe on 64bit.

The API is fine (or at least, I haven't seen a valid issue pointed out
yet).  The problem is the ucc_geth driver.


Even qe_muram_addr(a.k.a. cpm_muram_addr )passing value unsigned int,
but it should be unsigned long.

cpm_muram_addr takes unsigned long as a parameter, not that it matters
since you can't pass errors into it and a muram offset should never
exceed 32 bits.

-Scott

Yes, It will work for 32bit machine. But will not safe for 64bit.

Example :
ugeth->tx_bd_ring_offset[j] =
qe_muram_alloc(length  UCC_GETH_TX_BD_RING_ALIGNMENT);
if (!IS_ERR_VALUE(ugeth->tx_bd_ring_offset[j]))
   ugeth->p_tx_bd_ring[j] =
   (u8 __iomem *) qe_muram_addr(ugeth-> tx_bd_ring_offset[j]);

If qe_muram_alloc will return any error,  IS_ERR_VALUE will
always return 0 (IS_ERR_VALUE will always pass for 'unsigned int').
Now qe_muram_addr will return wrong virtual address. Which
can cause an error.

-Arvind

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/eeh: trivial fix to non-conventional PCI address output on EEH log

2016-08-03 Thread Guilherme G. Piccoli

On 07/24/2016 10:46 PM, Gavin Shan wrote:

On Mon, Jul 25, 2016 at 10:47:13AM +1000, Michael Ellerman wrote:

"Guilherme G. Piccoli" writes:

This is a very minor/trivial fix for the output of PCI address on EEH logs.
The PCI address on "OF node" field currently is using ":" as a separator
for the function, but the usual separator is ".". This patch changes the
separator for dot, so the PCI address is printed as usual.

No functional changes were introduced.

What consumes the log? Can it cope with us changing the formatting?

The log is printed by pr_warn() as part of the EEH kernel log. Also,
it's argument passed to RTAS call "ibm,slot-error-detail" and it's
put into the user data section of the RTAS call's output, which is
used by RTAS daemon (rtasd) then. I don't see anyone expects fixed
format for it in the user data section.

The format was ever adjusted in commit 0ed352dddbfc ("powerpc/eeh:
Reduce lines of log dump") on Jul 17 2014. No complains received
against it so far. I guess nobody cares about the format or there
is a alarm isn't raised yet :)

Thanks,
Gavin

Quick follow-up on this: RTAS daemon stores the information captured via
ibm,slot-error-detail in a log file, which can be accessed using the
command "rtas_dump -f /var/log/platform". More information on this can
be found in
https://www.ibm.com/support/knowledgecenter/linuxonibm/liaau/liaau-diagnosing-rtas-events.htm
.

I was able to check this log and the EEH PCI address output was there,
in ascii text format.

Thanks,

Guilherme

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures

On Wednesday, August 3, 2016 10:19:11 PM CEST Stephen Rothwell wrote:
> Hi Arnd,
> 
> On Wed, 03 Aug 2016 09:52:23 +0200 Arnd Bergmann  wrote:
> >
> > Using a different way to link the kernel would also help us with
> > the remaining allyesconfig problem on ARM, as the problem is only in
> > 'ld -r' not producing trampolines for symbols that later cannot get
> > them any more. It would probably also help building with ld.gold,
> > which is currently not working.
> > 
> > What is your suggested alternative?
> 
> I have a patch that make the built-in.o files into thin archives (same
> as archives, but the actual objects are replaced with the name of the
> original object file).  That way the final link has all the original
> objects.  I haven't checked to see what the overheads of doing it this
> way is.
> 
> Nick Piggin has just today taken my old patch (it was last rebased to
> v4.4-rc1) and tried it on a recent kernel and it still seems to mostly
> work.  It probably needs some tidying up, but you are welcome to test
> it if you want to.

Sure, I'll certainly give it a try on ARM when you send me a copy.

Arnd
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures

Hi Arnd,

On Wed, 03 Aug 2016 09:52:23 +0200 Arnd Bergmann  wrote:
>
> Using a different way to link the kernel would also help us with
> the remaining allyesconfig problem on ARM, as the problem is only in
> 'ld -r' not producing trampolines for symbols that later cannot get
> them any more. It would probably also help building with ld.gold,
> which is currently not working.
> 
> What is your suggested alternative?

I have a patch that make the built-in.o files into thin archives (same
as archives, but the actual objects are replaced with the name of the
original object file).  That way the final link has all the original
objects.  I haven't checked to see what the overheads of doing it this
way is.

Nick Piggin has just today taken my old patch (it was last rebased to
v4.4-rc1) and tried it on a recent kernel and it still seems to mostly
work.  It probably needs some tidying up, but you are welcome to test
it if you want to.

-- 
Cheers,
Stephen Rothwell
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: linker tables on powerpc - build issues

"Luis R. Rodriguez"  writes:

> I've run into a few compilation issues with linker tables support [0]
> [1] on only a few architectures:
>
> blackfin - compiler issue it seems, I have a work around now in place
> arm  - some alignment issue - still need to iron this out
> powerpc - issue with including  on 
>
> The issue with powerpc can be replicated easily with the patch below,
> and compilation fails even on a 'make defconfig' configuration, the
> issues are recurring include header ordering issues. I've given this
> some tries to fix but am still a bit bewildered how to best do this
> without affecting non-powerpc compilations.  The patch below
> replicates the changes in question, it does not include the linker
> table work at all, it just includes  instead of
>  to reduce and provide an example of the issues
> observed. The list of errors are also pretty endless... so was hoping
> some power folks might be able to take a glance if possible. If you
> have any ideas, please let me know.

What is the end goal?

You want to be able to include asm/sections.h in asm/jump_labels.h? So
that you can get some macros to wrap the pushsection etc, am I right?

The biggest problem I see is dereference_function_descriptor(), which
uses probe_kernel(), which pulls in uaccess.h.

But it doesn't really make sense for dereference_function_descriptor()
to be in sections.h AFAICS.

I'll see if I can unstitch it tomorrow.

cheers
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc: move hmi.c to arch/powerpc/kvm/

2016-08-03 Thread Paolo Bonzini

hmi.c functions are unused unless sibling_subcore_state is nonzero, and
that in turn happens only if KVM is in use.  So move the code to
arch/powerpc/kvm/, putting it under CONFIG_KVM_BOOK3S_64_HANDLER
rather than CONFIG_PPC_BOOK3S_64.  The sibling_subcore_state is also
included in struct paca_struct only if KVM is supported by the kernel.

Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Mahesh Salgaonkar 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm-...@vger.kernel.org
Cc: k...@vger.kernel.org
Signed-off-by: Paolo Bonzini 
---
It would be nice to have this in 4.8, to minimize any 4.9 conflicts.
Build-tested only, with and without KVM enabled.

 arch/powerpc/include/asm/hmi.h |  2 +-
 arch/powerpc/include/asm/paca.h| 10 +-
 arch/powerpc/kernel/Makefile   |  2 +-
 arch/powerpc/kvm/Makefile  |  1 +
 arch/powerpc/{kernel/hmi.c => kvm/book3s_hv_hmi.c} |  0
 5 files changed, 8 insertions(+), 7 deletions(-)
 rename arch/powerpc/{kernel/hmi.c => kvm/book3s_hv_hmi.c} (100%)

diff --git a/arch/powerpc/include/asm/hmi.h b/arch/powerpc/include/asm/hmi.h
index 88b4901ac4ee..d3b6ad6e137c 100644
--- a/arch/powerpc/include/asm/hmi.h
+++ b/arch/powerpc/include/asm/hmi.h
@@ -21,7 +21,7 @@
 #ifndef __ASM_PPC64_HMI_H__
 #define __ASM_PPC64_HMI_H__
 
-#ifdef CONFIG_PPC_BOOK3S_64
+#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 
 #defineCORE_TB_RESYNC_REQ_BIT  63
 #define MAX_SUBCORE_PER_CORE   4
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 148303e7771f..625321e7e581 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -183,11 +183,6 @@ struct paca_struct {
 */
u16 in_mce;
u8 hmi_event_available;  /* HMI event is available */
-   /*
-* Bitmap for sibling subcore status. See kvm/book3s_hv_ras.c for
-* more details
-*/
-   struct sibling_subcore_state *sibling_subcore_state;
 #endif
 
/* Stuff for accurate time accounting */
@@ -202,6 +197,11 @@ struct paca_struct {
struct kvmppc_book3s_shadow_vcpu shadow_vcpu;
 #endif
struct kvmppc_host_state kvm_hstate;
+   /*
+* Bitmap for sibling subcore status. See kvm/book3s_hv_ras.c for
+* more details
+*/
+   struct sibling_subcore_state *sibling_subcore_state;
 #endif
 };
 
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index b2027a5cf508..fe4c075bcf50 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -41,7 +41,7 @@ obj-$(CONFIG_VDSO32)  += vdso32/
 obj-$(CONFIG_HAVE_HW_BREAKPOINT)   += hw_breakpoint.o
 obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_ppc970.o cpu_setup_pa6t.o
 obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_power.o
-obj-$(CONFIG_PPC_BOOK3S_64)+= mce.o mce_power.o hmi.o
+obj-$(CONFIG_PPC_BOOK3S_64)+= mce.o mce_power.o
 obj-$(CONFIG_PPC_BOOK3E_64)+= exceptions-64e.o idle_book3e.o
 obj-$(CONFIG_PPC64)+= vdso64/
 obj-$(CONFIG_ALTIVEC)  += vecemu.o
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 1f9e5529e692..855d4b95d752 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -78,6 +78,7 @@ kvm-book3s_64-builtin-xics-objs-$(CONFIG_KVM_XICS) := \
 
 ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HANDLER) += \
+   book3s_hv_hmi.o \
book3s_hv_rmhandlers.o \
book3s_hv_rm_mmu.o \
book3s_hv_ras.o \
diff --git a/arch/powerpc/kernel/hmi.c b/arch/powerpc/kvm/book3s_hv_hmi.c
similarity index 100%
rename from arch/powerpc/kernel/hmi.c
rename to arch/powerpc/kvm/book3s_hv_hmi.c
-- 
1.8.3.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2] fadump: Disable deferred page struct initialisation

Vlastimil Babka  writes:

> On 08/03/2016 07:20 AM, Balbir Singh wrote:
>> On Tue, 2016-08-02 at 18:49 +0530, Srikar Dronamraju wrote:
>>> Fadump kernel reserves significant number of memory blocks. On a multi-node
>>> machine, with CONFIG_DEFFERRED_STRUCT_PAGE support, fadump kernel fails to
>>> boot. Fix this by disabling deferred page struct initialisation.
>>>
>>
>> How much memory does a fadump kernel need? Can we bump up the limits 
>> depending
>> on the config. I presume when you say fadump kernel you mean kernel with
>> FADUMP in the config?
>>
>> BTW, I would much rather prefer a config based solution that does not select
>> DEFERRED_INIT if FADUMP is enabled.
>
> IIRC the kdump/fadump kernel is typically the same vmlinux as the main 
> kernel, just with special initrd and boot params. So if you want 
> deferred init for the main kernel, this would be impractical.

Yes. Distros won't build a separate kernel, so it has to work at runtime.

cheers
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [pasemi] Radeon HD graphics card not recognised after the powerpc-4.8-1 commit

2016-08-03 Thread Benjamin Herrenschmidt

On Wed, 2016-08-03 at 11:03 +0200, Christian Zigotzky wrote:
> I reverted the commit "powerpc-4.8-1" and Xorg works. The commit 
> "powerpc-4.8-1" is the problem.
> 
> Link: 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bad60e6f259a01cf9f29a1ef8d435ab6c60b2de9
> 
> Which source code modification in the commit "powerpc-4.8-1" could be 
> the problem?

This is a merge, not a commit. Can you bisect down that branch ? Also
include the kernel dmesg log.

Cheers,
Ben.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel 06/15] powerpc/mm/iommu: Put pages on process exit

2016-08-03 Thread Nicholas Piggin

On Wed,  3 Aug 2016 18:40:47 +1000
Alexey Kardashevskiy  wrote:

> At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when
> the userspace starts using VFIO. When the userspace process finishes,
> all the pinned pages need to be put; this is done as a part of
> the userspace memory context (MM) destruction which happens on
> the very last mmdrop().
> 
> This approach has a problem that a MM of the userspace process
> may live longer than the userspace process itself as kernel threads
> use userspace process MMs which was runnning on a CPU where
> the kernel thread was scheduled to. If this happened, the MM remains
> referenced until this exact kernel thread wakes up again
> and releases the very last reference to the MM, on an idle system this
> can take even hours.
> 
> This references and caches MM once per container and adds tracking
> how many times each preregistered area was registered in
> a specific container. This way we do not depend on @current pointing to
> a valid task descriptor.
> 
> This changes the userspace interface to return EBUSY if memory is
> already registered (mm_iommu_get() used to increment the counter);
> however it should not have any practical effect as the only
> userspace tool available now does register memory area once per
> container anyway.
> 
> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
> under container->lock, this does not need additional locking.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: Nicholas Piggin 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx

2016-08-03 Thread Nicholas Piggin

On Wed,  3 Aug 2016 18:40:46 +1000
Alexey Kardashevskiy  wrote:

> In some situations the userspace memory context may live longer than
> the userspace process itself so if we need to do proper memory context
> cleanup, we better cache @mm and use it later when the process is gone
> (@current or @current->mm are NULL).
> 
> This changes mm_iommu_xxx API to receive mm_struct instead of using one
> from @current.
> 
> This is needed by the following patch to do proper cleanup in time.
> This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs"
> to do proper cleanup via tce_iommu_clear() patch.
> 
> To keep API consistent, this replaces mm_context_t with mm_struct;
> we stick to mm_struct as mm_iommu_adjust_locked_vm() helper needs
> access to &mm->mmap_sem.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: Nicholas Piggin 

I still have some questions about the use of mm in the driver, but
those aren't issues introduced by this patch, so as it is I think
the bug fix of this and the next patch is good.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[pasemi] Radeon HD graphics card not recognised after the powerpc-4.8-1 commit

2016-08-03 Thread Christian Zigotzky


Hello,

I tried to compile the latest Git kernel today. It boots but Xorg 
doesn't work anymore.


[41.210] (++) using VT number 7

[41.341] (II) [KMS] Kernel modesetting enabled.
[41.341] (EE) No devices detected.
[41.341] (EE)
Fatal server error:
[41.341] (EE) no screens found(EE)
[41.341] (EE)

I reverted the commit "powerpc-4.8-1" and Xorg works. The commit 
"powerpc-4.8-1" is the problem.


Link: 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bad60e6f259a01cf9f29a1ef8d435ab6c60b2de9


Which source code modification in the commit "powerpc-4.8-1" could be 
the problem?


Cheers,

Christian
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH kernel 09/15] powerpc/mmu: Add real mode support for IOMMU preregistered memory

This makes mm_iommu_lookup() able to work in realmode by replacing
list_for_each_entry_rcu() (which can do debug stuff which can fail in
real mode) with list_for_each_entry_lockless().

This adds realmode version of mm_iommu_ua_to_hpa() which adds
explicit vmalloc'd-to-linear address conversion.
Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail.

This changes mm_iommu_preregistered() to receive @mm as in real mode
@current does not always have a correct pointer.

This adds realmode version of mm_iommu_lookup() which receives @mm
(for the same reason as for mm_iommu_preregistered()) and uses
lockless version of list_for_each_entry_rcu().

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/mmu_context.h |  4 
 arch/powerpc/mm/mmu_context_iommu.c| 39 ++
 2 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index a4c4ed5..939030c 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -27,10 +27,14 @@ extern long mm_iommu_put(struct mm_struct *mm,
 extern void mm_iommu_init(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
+   struct mm_struct *mm, unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+   unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 10f01fe..36a906c 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -242,6 +242,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct 
mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
+   unsigned long ua, unsigned long size)
+{
+   struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+   list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
+   next) {
+   if ((mem->ua <= ua) &&
+   (ua + size <= mem->ua +
+(mem->entries << PAGE_SHIFT))) {
+   ret = mem;
+   break;
+   }
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm);
+
 struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
unsigned long ua, unsigned long entries)
 {
@@ -273,6 +292,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t 
*mem,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
+long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
+   unsigned long ua, unsigned long *hpa)
+{
+   const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+   void *va = &mem->hpas[entry];
+   unsigned long *ra;
+
+   if (entry >= mem->entries)
+   return -EFAULT;
+
+   ra = (void *) vmalloc_to_phys(va);
+   if (!ra)
+   return -EFAULT;
+
+   *hpa = *ra | (ua & ~PAGE_MASK);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
if (atomic64_inc_not_zero(&mem->mapped))
-- 
2.5.0.rc3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH kernel 08/15] powerpc/vfio_spapr_tce: Add reference counting to iommu_table

So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change by implementing in-kernel
acceleration of DMA mapping requests, including real mode.

This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_table_put() and makes
iommu_free_table() static. iommu_table_get() is not used in this patch
but will be in the following one.

While we are here, this removes @node_name parameter as it has never been
really useful on powernv and carrying it for the pseries platform code to
iommu_free_table() seems to be quite useless too.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/iommu.h  |  5 +++--
 arch/powerpc/kernel/iommu.c   | 24 +++-
 arch/powerpc/kernel/vio.c |  2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++---
 arch/powerpc/platforms/powernv/pci.c  |  1 +
 arch/powerpc/platforms/pseries/iommu.c|  3 ++-
 drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
 7 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index f49a72a..cd4df44 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -114,6 +114,7 @@ struct iommu_table {
struct list_head it_group_list;/* List of iommu_table_group_link */
unsigned long *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops;
+   struct krefit_kref;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
@@ -146,8 +147,8 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
 
-/* Frees table for an individual device node */
-extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
+extern void iommu_table_get(struct iommu_table *tbl);
+extern void iommu_table_put(struct iommu_table *tbl);
 
 /* Initializes an iommu_table based in values set in the passed-in
  * structure
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 13263b0..a8f017a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -710,13 +710,13 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid)
return tbl;
 }
 
-void iommu_free_table(struct iommu_table *tbl, const char *node_name)
+static void iommu_table_free(struct kref *kref)
 {
unsigned long bitmap_sz;
unsigned int order;
+   struct iommu_table *tbl;
 
-   if (!tbl)
-   return;
+   tbl = container_of(kref, struct iommu_table, it_kref);
 
if (tbl->it_ops->free)
tbl->it_ops->free(tbl);
@@ -735,7 +735,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
 
/* verify that table contains no entries */
if (!bitmap_empty(tbl->it_map, tbl->it_size))
-   pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
+   pr_warn("%s: Unexpected TCEs\n", __func__);
 
/* calculate bitmap size in bytes */
bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -747,7 +747,21 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
/* free table */
kfree(tbl);
 }
-EXPORT_SYMBOL_GPL(iommu_free_table);
+
+void iommu_table_get(struct iommu_table *tbl)
+{
+   kref_get(&tbl->it_kref);
+}
+EXPORT_SYMBOL_GPL(iommu_table_get);
+
+void iommu_table_put(struct iommu_table *tbl)
+{
+   if (!tbl)
+   return;
+
+   kref_put(&tbl->it_kref, iommu_table_free);
+}
+EXPORT_SYMBOL_GPL(iommu_table_put);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 8d7358f..188f452 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev)
struct iommu_table *tbl = get_iommu_table_base(dev);
 
if (tbl)
-   iommu_free_table(tbl, of_node_full_name(dev->of_node));
+   iommu_table_put(tbl);
of_node_put(dev->of_node);
kfree(to_vio_dev(dev));
 }
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 74ab8382..c04afd2 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1394,7 +1394,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
*dev, struct pnv_ioda_pe
iommu_group_put(pe->table_group.group);
BUG_ON(pe->table_group.group);
}
-   iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+   iommu_table_put(tbl);
 }
 
 static void pnv_ioda_release_vf_PE(struc

[PATCH kernel 07/15] powerpc/iommu: Cleanup iommu_table disposal

At the moment iommu_table could be disposed by either calling
iommu_table_free() directly or it_ops::free() which only implementation
for IODA2 calls iommu_table_free() anyway.

As we are going to have reference counting on tables, we need an unified
way of disposing tables.

This moves it_ops::free() call into iommu_free_table() and makes use
of the latter everywhere. The free() callback now handles only
platform-specific data.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/iommu.c   | 4 
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++
 drivers/vfio/vfio_iommu_spapr_tce.c   | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index a8e3490..13263b0 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -718,6 +718,9 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
if (!tbl)
return;
 
+   if (tbl->it_ops->free)
+   tbl->it_ops->free(tbl);
+
if (!tbl->it_map) {
kfree(tbl);
return;
@@ -744,6 +747,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
/* free table */
kfree(tbl);
 }
+EXPORT_SYMBOL_GPL(iommu_free_table);
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address passed here
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 59c7e7d..74ab8382 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1394,7 +1394,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
*dev, struct pnv_ioda_pe
iommu_group_put(pe->table_group.group);
BUG_ON(pe->table_group.group);
}
-   pnv_pci_ioda2_table_free_pages(tbl);
iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
 }
 
@@ -1987,7 +1986,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
long index,
 static void pnv_ioda2_table_free(struct iommu_table *tbl)
 {
pnv_pci_ioda2_table_free_pages(tbl);
-   iommu_free_table(tbl, "pnv");
 }
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
@@ -2313,7 +2311,7 @@ static long pnv_pci_ioda2_setup_default_config(struct 
pnv_ioda_pe *pe)
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
rc);
-   pnv_ioda2_table_free(tbl);
+   iommu_free_table(tbl, "");
return rc;
}
 
@@ -2399,7 +2397,7 @@ static void pnv_ioda2_take_ownership(struct 
iommu_table_group *table_group)
 
pnv_pci_ioda2_set_bypass(pe, false);
pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-   pnv_ioda2_table_free(tbl);
+   iommu_free_table(tbl, "pnv");
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 40e71a0..79f26c7 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -660,7 +660,7 @@ static void tce_iommu_free_table(struct iommu_table *tbl)
unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
tce_iommu_userspace_view_free(tbl);
-   tbl->it_ops->free(tbl);
+   iommu_free_table(tbl, "");
decrement_locked_vm(pages);
 }
 
-- 
2.5.0.rc3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH kernel 15/15] KVM: PPC: Add in-kernel acceleration for VFIO

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

Both real and virtual modes are supported. The kernel tries to
handle a TCE request in the real mode, if fails it passes the request
to the virtual mode to complete the operation. If it a virtual mode
handler fails, the request is passed to user space; this is not expected
to happen ever though.

The first user of this is VFIO on POWER. Trampolines to the VFIO external
user API functions are required for this patch.

This adds ioctl() interface to SPAPR TCE fd which already handles
in-kernel acceleration for emulated IO by allocating the guest view of
the TCE table in KVM. New ioctls allows the userspace to attach/detach
VFIO containers to the kernel-allocated TCE table and handle
the hardware TCE table updates in the kernel. The new interface
accepts VFIO container fd and uses exported API to get to the actual
hardware TCE table. Until _unset() ioctl is called, the VFIO container
is referenced to guarantee the TCE table presense in the memory.

This also releases unused containers when new container is registered.
The criteria of "unused" is vfio_container_get_iommu_data_ext()
returning NULL which happens when the container fd is closed.

Note that this interface does not operate with IOMMU groups as
TCE tables are owned by VFIO containers (and even may have no IOMMU groups
attached).

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/kvm_host.h |   8 +
 arch/powerpc/include/uapi/asm/kvm.h |  12 ++
 arch/powerpc/kvm/book3s_64_vio.c| 403 
 arch/powerpc/kvm/book3s_64_vio_hv.c | 173 
 arch/powerpc/kvm/powerpc.c  |   2 +
 5 files changed, 598 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index ec35af3..3e3d65f 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -182,6 +182,13 @@ struct kvmppc_pginfo {
atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_container {
+   struct list_head next;
+   struct rcu_head rcu;
+   struct vfio_container *vfiocontainer;
+   struct iommu_table *tbl;
+};
+
 struct kvmppc_spapr_tce_table {
struct list_head list;
struct kvm *kvm;
@@ -190,6 +197,7 @@ struct kvmppc_spapr_tce_table {
u32 page_shift;
u64 offset; /* in pages */
u64 size;   /* window size in pages */
+   struct list_head containers;
struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index c93cf35..cbeb7bb 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -342,6 +342,18 @@ struct kvm_create_spapr_tce_64 {
__u64 size; /* in pages */
 };
 
+#define KVM_SPAPR_TCE  (':')
+#define KVM_SPAPR_TCE_VFIO_SET _IOW(KVM_SPAPR_TCE,  0x00, \
+struct kvm_spapr_tce_vfio)
+#define KVM_SPAPR_TCE_VFIO_UNSET   _IOW(KVM_SPAPR_TCE,  0x01, \
+struct kvm_spapr_tce_vfio)
+
+struct kvm_spapr_tce_vfio {
+   __u32 argsz;
+   __u32 flags;
+   __u32 container_fd;
+};
+
 /* for KVM_ALLOCATE_RMA */
 struct kvm_allocate_rma {
__u64 rma_size;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 15df8ae..d420ee0 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -39,6 +43,70 @@
 #include 
 #include 
 #include 
+#include 
+
+static struct iommu_table *kvm_vfio_container_spapr_tce_table_get_ext(
+   void *iommu_data, u64 offset)
+{
+   struct iommu_table *tbl;
+   struct iommu_table *(*fn)(void *, u64);
+
+   fn = symbol_get(vfio_container_spapr_tce_table_get_ext);
+   if (!fn)
+   return NULL;
+
+   tbl = fn(iommu_data, offset);
+
+   symbol_put(vfio_container_spapr_tce_table_get_ext);
+
+   return tbl;
+}
+
+static struct vfio_container *kvm_vfio_container_get_ext(struct file *filep)
+{
+   struct vfio_container *container;
+   struct vfio_container *(*fn)(struct file *);
+
+   fn = symbol_get(vfio_container_get_ext);
+   if (!fn)
+   return NULL;
+
+   container = fn(filep);
+
+   symbol_put(vfio_container_get_ext);
+
+   return container;
+}
+
+static void kvm_vfio_container_put_ext(struct vfio_container *container)

[PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users

This exports helpers which are needed to keep a VFIO container in
memory while there are external users such as KVM.

Signed-off-by: Alexey Kardashevskiy 
---
 drivers/vfio/vfio.c | 30 ++
 drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++-
 include/linux/vfio.h|  6 ++
 3 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index d1d70e0..baf6a9c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group 
*group, unsigned long arg)
 EXPORT_SYMBOL_GPL(vfio_external_check_extension);
 
 /**
+ * External user API for containers, exported by symbols to be linked
+ * dynamically.
+ *
+ */
+struct vfio_container *vfio_container_get_ext(struct file *filep)
+{
+   struct vfio_container *container = filep->private_data;
+
+   if (filep->f_op != &vfio_fops)
+   return ERR_PTR(-EINVAL);
+
+   vfio_container_get(container);
+
+   return container;
+}
+EXPORT_SYMBOL_GPL(vfio_container_get_ext);
+
+void vfio_container_put_ext(struct vfio_container *container)
+{
+   vfio_container_put(container);
+}
+EXPORT_SYMBOL_GPL(vfio_container_put_ext);
+
+void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
+{
+   return container->iommu_data;
+}
+EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
+
+/**
  * Sub-module support
  */
 /*
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 3594ad3..fceea3d 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops 
= {
.detach_group   = tce_iommu_detach_group,
 };
 
+struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
+   u64 offset)
+{
+   struct tce_container *container = iommu_data;
+   struct iommu_table *tbl = NULL;
+
+   if (tce_iommu_find_table(container, offset, &tbl) < 0)
+   return NULL;
+
+   iommu_table_get(tbl);
+
+   return tbl;
+}
+EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
+
 static int __init tce_iommu_init(void)
 {
return vfio_register_iommu_driver(&tce_iommu_driver_ops);
@@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR(DRIVER_AUTHOR);
 MODULE_DESCRIPTION(DRIVER_DESC);
-
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b..1c2138a 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group 
*group);
 extern int vfio_external_user_iommu_id(struct vfio_group *group);
 extern long vfio_external_check_extension(struct vfio_group *group,
  unsigned long arg);
+extern struct vfio_container *vfio_container_get_ext(struct file *filep);
+extern void vfio_container_put_ext(struct vfio_container *container);
+extern void *vfio_container_get_iommu_data_ext(
+   struct vfio_container *container);
+extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
+   void *iommu_data, u64 offset);
 
 /*
  * Sub-module helpers
-- 
2.5.0.rc3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH kernel 02/15] KVM: PPC: Finish enabling VFIO KVM device on POWER

178a787502 "vfio: Enable VFIO device for powerpc" made an attempt to
enable VFIO KVM device on POWER.

However as CONFIG_KVM_BOOK3S_64 does not use "common-objs-y",
VFIO KVM device was not enabled for Book3s KVM, this adds VFIO to
the kvm-book3s_64-objs-y list.

While we are here, enforce KVM_VFIO on KVM_BOOK3S as other platforms
already do.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/Kconfig  | 1 +
 arch/powerpc/kvm/Makefile | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index c2024ac..b7c494b 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -64,6 +64,7 @@ config KVM_BOOK3S_64
select KVM_BOOK3S_64_HANDLER
select KVM
select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
+   select KVM_VFIO if VFIO
---help---
  Support running unmodified book3s_64 and book3s_32 guest kernels
  in virtual machines on book3s_64 host processors.
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 1f9e552..8907af9 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -88,6 +88,9 @@ endif
 kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
book3s_xics.o
 
+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
+   $(KVM)/vfio.o
+
 kvm-book3s_64-module-objs += \
$(KVM)/kvm_main.o \
$(KVM)/eventfd.o \
-- 
2.5.0.rc3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH kernel 13/15] KVM: PPC: Pass kvm* to kvmppc_find_table()

The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c|  7 ---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++--
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 2544eda..7f1abe9 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
 extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-   struct kvm_vcpu *vcpu, unsigned long liobn);
+   struct kvm *kvm, unsigned long liobn);
 extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt,
unsigned long ioba, unsigned long npages);
 extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index c379ff5..15df8ae 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -212,12 +212,13 @@ fail:
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba, unsigned long tce)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
 
/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
/*  liobn, ioba, tce); */
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
u64 __user *tces;
u64 tce;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
struct kvmppc_spapr_tce_table *stt;
long i, ret;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index a3be4bd..8a6834e 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -49,10 +49,9 @@
  * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *  mode on PR KVM
  */
-struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu,
+struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm,
unsigned long liobn)
 {
-   struct kvm *kvm = vcpu->kvm;
struct kvmppc_spapr_tce_table *stt;
 
list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list)
@@ -194,12 +193,13 @@ static struct mm_iommu_table_group_mem_t 
*kvmppc_rm_iommu_lookup(
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
 
/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
/*  liobn, ioba, tce); */
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -252,7 +252,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
unsigned long tces, entry, ua = 0;
unsigned long *rmap = NULL;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -335,7 +335,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
struct kvmppc_spapr_tce_table *stt;
long i, ret;
 
-   stt = kvmppc_find_table(vcpu, liobn);
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
@@ -356,12 +356,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba)
 {
-   struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn);
+   struct kvmppc_spapr_tce_table *stt;
long ret;
unsigned long idx;
struct page *page;
u64 *tbl;
 
+   stt = kvmppc_find_table(vcpu->kvm, liobn);
if (!stt)
return H_TOO_HARD;
 
-- 
2.5.0.rc3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
ht

[PATCH kernel 01/15] Revert "iommu: Add a function to find an iommu group by id"

This reverts commit aa16bea929ae
("iommu: Add a function to find an iommu group by id")
as the iommu_group_get_by_id() helper has never been used
and it is unlikely it will in foreseeable future. Dead code
is broken code.

Signed-off-by: Alexey Kardashevskiy 
---
 drivers/iommu/iommu.c | 29 -
 include/linux/iommu.h |  1 -
 2 files changed, 30 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index b06d935..d2f5efe 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -217,35 +217,6 @@ struct iommu_group *iommu_group_alloc(void)
 }
 EXPORT_SYMBOL_GPL(iommu_group_alloc);
 
-struct iommu_group *iommu_group_get_by_id(int id)
-{
-   struct kobject *group_kobj;
-   struct iommu_group *group;
-   const char *name;
-
-   if (!iommu_group_kset)
-   return NULL;
-
-   name = kasprintf(GFP_KERNEL, "%d", id);
-   if (!name)
-   return NULL;
-
-   group_kobj = kset_find_obj(iommu_group_kset, name);
-   kfree(name);
-
-   if (!group_kobj)
-   return NULL;
-
-   group = container_of(group_kobj, struct iommu_group, kobj);
-   BUG_ON(group->id != id);
-
-   kobject_get(group->devices_kobj);
-   kobject_put(&group->kobj);
-
-   return group;
-}
-EXPORT_SYMBOL_GPL(iommu_group_get_by_id);
-
 /**
  * iommu_group_get_iommudata - retrieve iommu_data registered for a group
  * @group: the group
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a35fb8b..93c69fa 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -215,7 +215,6 @@ extern int bus_set_iommu(struct bus_type *bus, const struct 
iommu_ops *ops);
 extern bool iommu_present(struct bus_type *bus);
 extern bool iommu_capable(struct bus_type *bus, enum iommu_cap cap);
 extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
-extern struct iommu_group *iommu_group_get_by_id(int id);
 extern void iommu_domain_free(struct iommu_domain *domain);
 extern int iommu_attach_device(struct iommu_domain *domain,
   struct device *dev);
-- 
2.5.0.rc3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH kernel 12/15] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently

It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index b7c494b..63b60a8 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -65,6 +65,7 @@ config KVM_BOOK3S_64
select KVM
select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
select KVM_VFIO if VFIO
+   select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
---help---
  Support running unmodified book3s_64 and book3s_32 guest kernels
  in virtual machines on book3s_64 host processors.
-- 
2.5.0.rc3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration


This is my current queue of patches to add acceleration of TCE
updates in KVM. This has a long history and was rewritten pretty
much completely again, this time I am teaching KVM about VFIO
containers. Some patches (such as 01/15) could be posted
separately but I keep all of them here to make review easier
(if the concept turns out be wrong - then I might still want
to have 01/15).

Please comment. Thanks.


Alexey Kardashevskiy (15):
  Revert "iommu: Add a function to find an iommu group by id"
  KVM: PPC: Finish enabling VFIO KVM device on POWER
  KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
  powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again
  powerpc/iommu: Stop using @current in mm_iommu_xxx
  powerpc/mm/iommu: Put pages on process exit
  powerpc/iommu: Cleanup iommu_table disposal
  powerpc/vfio_spapr_tce: Add reference counting to iommu_table
  powerpc/mmu: Add real mode support for IOMMU preregistered memory
  KVM: PPC: Use preregistered memory API to access TCE list
  powerpc/powernv/iommu: Add real mode version of
iommu_table_ops::exchange()
  KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Pass kvm* to kvmppc_find_table()
  vfio/spapr_tce: Export container API for external users
  KVM: PPC: Add in-kernel acceleration for VFIO

 arch/powerpc/include/asm/iommu.h  |  12 +-
 arch/powerpc/include/asm/kvm_host.h   |   8 +
 arch/powerpc/include/asm/kvm_ppc.h|   2 +-
 arch/powerpc/include/asm/mmu_context.h|  23 +-
 arch/powerpc/include/uapi/asm/kvm.h   |  12 +
 arch/powerpc/kernel/iommu.c   |  49 +++-
 arch/powerpc/kernel/setup-common.c|   2 +-
 arch/powerpc/kernel/vio.c |   2 +-
 arch/powerpc/kvm/Kconfig  |   2 +
 arch/powerpc/kvm/Makefile |   3 +
 arch/powerpc/kvm/book3s_64_vio.c  | 410 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c   | 251 --
 arch/powerpc/kvm/powerpc.c|   2 +
 arch/powerpc/mm/mmu_context_book3s64.c|   6 +-
 arch/powerpc/mm/mmu_context_iommu.c   |  96 ---
 arch/powerpc/platforms/powernv/pci-ioda.c |  46 +++-
 arch/powerpc/platforms/powernv/pci.c  |   1 +
 arch/powerpc/platforms/pseries/iommu.c|   3 +-
 drivers/iommu/iommu.c |  29 ---
 drivers/vfio/vfio.c   |  30 +++
 drivers/vfio/vfio_iommu_spapr_tce.c   | 107 ++--
 include/linux/iommu.h |   1 -
 include/linux/vfio.h  |   6 +
 include/uapi/linux/kvm.h  |   1 +
 24 files changed, 959 insertions(+), 145 deletions(-)

-- 
2.5.0.rc3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH kernel 11/15] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()

In real mode, TCE tables are invalidated using special
cache-inhibited store instructions which are not available in
virtual mode

This defines and implements exchange_rm() callback. This does not
define set_rm/clear_rm/flush_rm callbacks as there is no user for those -
exchange/exchange_rm are only to be used by KVM for VFIO.

The exchange_rm callback is defined for IODA1/IODA2 powernv platforms.

This replaces list_for_each_entry_rcu with its lockless version as
from now on pnv_pci_ioda2_tce_invalidate() can be called in
the real mode too.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/iommu.h  |  7 +++
 arch/powerpc/kernel/iommu.c   | 23 +++
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index cd4df44..a13d207 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,11 @@ struct iommu_table_ops {
long index,
unsigned long *hpa,
enum dma_data_direction *direction);
+   /* Real mode */
+   int (*exchange_rm)(struct iommu_table *tbl,
+   long index,
+   unsigned long *hpa,
+   enum dma_data_direction *direction);
 #endif
void (*clear)(struct iommu_table *tbl,
long index, long npages);
@@ -209,6 +214,8 @@ extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+   unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index a8f017a..65b2dac 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1020,6 +1020,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned 
long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
+long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry,
+   unsigned long *hpa, enum dma_data_direction *direction)
+{
+   long ret;
+
+   ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+
+   if (!ret && ((*direction == DMA_FROM_DEVICE) ||
+   (*direction == DMA_BIDIRECTIONAL))) {
+   struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT);
+
+   if (likely(pg)) {
+   SetPageDirty(pg);
+   } else {
+   tbl->it_ops->exchange_rm(tbl, entry, hpa, direction);
+   ret = -EFAULT;
+   }
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm);
+
 int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c04afd2..a0b5ea6 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1827,6 +1827,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, 
long index,
 
return ret;
 }
+
+static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
+   unsigned long *hpa, enum dma_data_direction *direction)
+{
+   long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+
+   if (!ret)
+   pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
+
+   return ret;
+}
 #endif
 
 static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
@@ -1841,6 +1852,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.set = pnv_ioda1_tce_build,
 #ifdef CONFIG_IOMMU_API
.exchange = pnv_ioda1_tce_xchg,
+   .exchange_rm = pnv_ioda1_tce_xchg_rm,
 #endif
.clear = pnv_ioda1_tce_free,
.get = pnv_tce_get,
@@ -1915,7 +1927,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
iommu_table *tbl,
 {
struct iommu_table_group_link *tgl;
 
-   list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+   list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
struct pnv_ioda_pe *pe = container_of(tgl->table_group,
struct pnv_ioda_pe, table_group);
struct pnv_phb *phb = pe->phb;
@@ -1973,6 +1985,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, 
long index,
 
return ret;
 }
+
+static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
+   unsigned long *hpa, enum d

[PATCH kernel 10/15] KVM: PPC: Use preregistered memory API to access TCE list

VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL where declared
(not in this patch).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* updated the commit log with Paul's comment
---
 arch/powerpc/kvm/book3s_64_vio_hv.c | 65 -
 1 file changed, 49 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index d461c44..a3be4bd 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
+{
+   return mm_iommu_preregistered(vcpu->kvm->mm);
+}
+
+static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
+   struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
+{
+   return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
 {
@@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if (ret != H_SUCCESS)
return ret;
 
-   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
-   return H_TOO_HARD;
+   if (kvmppc_preregistered(vcpu)) {
+   /*
+* We get here if guest memory was pre-registered which
+* is normally VFIO case and gpa->hpa translation does not
+* depend on hpt.
+*/
+   struct mm_iommu_table_group_mem_t *mem;
 
-   rmap = (void *) vmalloc_to_phys(rmap);
+   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
+   return H_TOO_HARD;
 
-   /*
-* Synchronize with the MMU notifier callbacks in
-* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
-* While we have the rmap lock, code running on other CPUs
-* cannot finish unmapping the host real page that backs
-* this guest real page, so we are OK to access the host
-* real page.
-*/
-   lock_rmap(rmap);
-   if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
-   ret = H_TOO_HARD;
-   goto unlock_exit;
+   mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
+   if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
+   return H_TOO_HARD;
+   } else {
+   /*
+* This is emulated devices case.
+* We do not require memory to be preregistered in this case
+* so lock rmap and do __find_linux_pte_or_hugepte().
+*/
+   if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
+   return H_TOO_HARD;
+
+   rmap = (void *) vmalloc_to_phys(rmap);
+
+   /*
+* Synchronize with the MMU notifier callbacks in
+* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
+* While we have the rmap lock, code running on other CPUs
+* cannot finish unmapping the host real page that backs
+* this guest real page, so we are OK to access the host
+* real page.
+*/
+   lock_rmap(rmap);
+   if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
+   ret = H_TOO_HARD;
+   goto unlock_exit;
+   }
}
 
for (i = 0; i < npages; ++i) {
@@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
}
 
 unlock_exit:
-   unlock_rmap(rmap);
+   if (rmap)
+   unlock_rmap(rmap);
 
return ret;
 }
-- 
2.5.0.rc3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH kernel 06/15] powerpc/mm/iommu: Put pages on process exit

At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when
the userspace starts using VFIO. When the userspace process finishes,
all the pinned pages need to be put; this is done as a part of
the userspace memory context (MM) destruction which happens on
the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This references and caches MM once per container and adds tracking
how many times each preregistered area was registered in
a specific container. This way we do not depend on @current pointing to
a valid task descriptor.

This changes the userspace interface to return EBUSY if memory is
already registered (mm_iommu_get() used to increment the counter);
however it should not have any practical effect as the only
userspace tool available now does register memory area once per
container anyway.

As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.

Signed-off-by: Alexey Kardashevskiy 

# Conflicts:
#   arch/powerpc/include/asm/mmu_context.h
#   arch/powerpc/mm/mmu_context_book3s64.c
#   arch/powerpc/mm/mmu_context_iommu.c
---
 arch/powerpc/include/asm/mmu_context.h |  1 -
 arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
 arch/powerpc/mm/mmu_context_iommu.c| 11 ---
 drivers/vfio/vfio_iommu_spapr_tce.c| 52 +-
 4 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index b85cc7b..a4c4ed5 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -25,7 +25,6 @@ extern long mm_iommu_get(struct mm_struct *mm,
 extern long mm_iommu_put(struct mm_struct *mm,
struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(struct mm_struct *mm);
-extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index ad82735..1a07969 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct 
*mm)
 
 void destroy_context(struct mm_struct *mm)
 {
-#ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_cleanup(mm);
-#endif
-
 #ifdef CONFIG_PPC_ICSWX
drop_cop(mm->context.acop, mm);
kfree(mm->context.cop_lockp);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index ee6685b..10f01fe 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -293,14 +293,3 @@ void mm_iommu_init(struct mm_struct *mm)
 {
INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
 }
-
-void mm_iommu_cleanup(struct mm_struct *mm)
-{
-   struct mm_iommu_table_group_mem_t *mem, *tmp;
-
-   list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
-   next) {
-   list_del_rcu(&mem->next);
-   mm_iommu_do_free(mem);
-   }
-}
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 9752e77..40e71a0 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -89,6 +89,15 @@ struct tce_iommu_group {
 };
 
 /*
+ * A container needs to remember which preregistered areas and how many times
+ * it has referenced to do proper cleanup at the userspace process exit.
+ */
+struct tce_iommu_prereg {
+   struct list_head next;
+   struct mm_iommu_table_group_mem_t *mem;
+};
+
+/*
  * The container descriptor supports only a single group per container.
  * Required by the API as the container is not supplied with the IOMMU group
  * at the moment of initialization.
@@ -101,12 +110,26 @@ struct tce_container {
struct mm_struct *mm;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct list_head group_list;
+   struct list_head prereg_list;
 };
 
+static long tce_iommu_prereg_free(struct tce_container *container,
+   struct tce_iommu_prereg *tcemem)
+{
+   long ret;
+
+   list_del(&tcemem->next);
+   ret = mm_iommu_put(container->mm, tcemem->mem);
+   kfree(tcemem);
+
+   return ret;
+}
+
 static long tce_iommu_unregister_pages(struct tce_container *container,
__u64 vaddr, __u64

[PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx

In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better cache @mm and use it later when the process is gone
(@current or @current->mm are NULL).

This changes mm_iommu_xxx API to receive mm_struct instead of using one
from @current.

This is needed by the following patch to do proper cleanup in time.
This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs"
to do proper cleanup via tce_iommu_clear() patch.

To keep API consistent, this replaces mm_context_t with mm_struct;
we stick to mm_struct as mm_iommu_adjust_locked_vm() helper needs
access to &mm->mmap_sem.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/mmu_context.h | 20 +++--
 arch/powerpc/kernel/setup-common.c |  2 +-
 arch/powerpc/mm/mmu_context_book3s64.c |  4 +--
 arch/powerpc/mm/mmu_context_iommu.c| 54 ++
 drivers/vfio/vfio_iommu_spapr_tce.c| 41 --
 5 files changed, 62 insertions(+), 59 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 9d2cd0c..b85cc7b 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -18,16 +18,18 @@ extern void destroy_context(struct mm_struct *mm);
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 struct mm_iommu_table_group_mem_t;
 
-extern bool mm_iommu_preregistered(void);
-extern long mm_iommu_get(unsigned long ua, unsigned long entries,
+extern bool mm_iommu_preregistered(struct mm_struct *mm);
+extern long mm_iommu_get(struct mm_struct *mm,
+   unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem);
-extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
-extern void mm_iommu_init(mm_context_t *ctx);
-extern void mm_iommu_cleanup(mm_context_t *ctx);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
-   unsigned long size);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
-   unsigned long entries);
+extern long mm_iommu_put(struct mm_struct *mm,
+   struct mm_iommu_table_group_mem_t *mem);
+extern void mm_iommu_init(struct mm_struct *mm);
+extern void mm_iommu_cleanup(struct mm_struct *mm);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
+   unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+   unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 714b4ba..e90b68a 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -905,7 +905,7 @@ void __init setup_arch(char **cmdline_p)
init_mm.context.pte_frag = NULL;
 #endif
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_init(&init_mm.context);
+   mm_iommu_init(&init_mm);
 #endif
irqstack_early_init();
exc_lvl_early_init();
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index b114f8b..ad82735 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct 
mm_struct *mm)
mm->context.pte_frag = NULL;
 #endif
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_init(&mm->context);
+   mm_iommu_init(mm);
 #endif
return 0;
 }
@@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct mm_struct 
*mm)
 void destroy_context(struct mm_struct *mm)
 {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_cleanup(&mm->context);
+   mm_iommu_cleanup(mm);
 #endif
 
 #ifdef CONFIG_PPC_ICSWX
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index da6a216..ee6685b 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -53,7 +53,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
}
 
pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
-   current->pid,
+   current ? current->pid : 0,
incr ? '+' : '-',
npages << PAGE_SHIFT,
mm->locked_vm << PAGE_SHIFT,
@@ -63,28 +63,22 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
return ret;
 }
 
-bool mm_iommu_preregistered(void)
+bool mm_iommu_preregistered(struct mm_struct *mm)
 {
-   if (!current || !current->mm)
-   return false;
-
-   return !list_empty(¤t->mm-

[PATCH kernel 04/15] powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again

"powerpc/powernv/pci: Rework accessing the TCE invalidate register"
broke TCE invalidation on IODA2/PHB3 for real mode.

This makes invalidate work again.

Fixes: fd141d1a99a3
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 53b56c0..59c7e7d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1877,7 +1877,7 @@ static void pnv_pci_phb3_tce_invalidate(struct 
pnv_ioda_pe *pe, bool rm,
unsigned shift, unsigned long index,
unsigned long npages)
 {
-   __be64 __iomem *invalidate = pnv_ioda_get_inval_reg(pe->phb, false);
+   __be64 __iomem *invalidate = pnv_ioda_get_inval_reg(pe->phb, rm);
unsigned long start, end, inc;
 
/* We'll invalidate DMA address in PE scope */
@@ -1935,10 +1935,12 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
iommu_table *tbl,
pnv_pci_phb3_tce_invalidate(pe, rm, shift,
index, npages);
else if (rm)
+   {
opal_rm_pci_tce_kill(phb->opal_id,
 OPAL_PCI_TCE_KILL_PAGES,
 pe->pe_number, 1u << shift,
 index << shift, npages);
+   }
else
opal_pci_tce_kill(phb->opal_id,
  OPAL_PCI_TCE_KILL_PAGES,
-- 
2.5.0.rc3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH kernel 03/15] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number

This adds a capability number for in-kernel support for VFIO on
SPAPR platform.

The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e98bb4c..3b4b723 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -870,6 +870,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_USER_INSTR0 130
 #define KVM_CAP_MSI_DEVID 131
 #define KVM_CAP_PPC_HTM 132
+#define KVM_CAP_SPAPR_TCE_VFIO 133
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.5.0.rc3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RESENT PATCH v5 1/2] tools/perf: Fix the mask in regs_dump__printf and print_sample_iregs

2016-08-03 Thread Madhavan Srinivasan

When decoding the perf_regs mask in regs_dump__printf(),
we loop through the mask using find_first_bit and find_next_bit functions.
"mask" is of type "u64", but sent as a "unsigned long *" to
lib functions along with sizeof().

While the exisitng code works fine in most of the case,
the logic is broken when using a 32bit perf on a 64bit kernel (Big Endian).
When reading u64 using (u32 *)(&val)[0], perf (lib/find_*_bit()) assumes it gets
lower 32bits of u64 which is wrong. Proposed fix is to swap the words
of the u64 to handle this case. This is _not_ endianess swap.

Suggested-by: Yury Norov 
Reviewed-by: Yury Norov 
Acked-by: Jiri Olsa 
Cc: Yury Norov 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Adrian Hunter 
Cc: Kan Liang 
Cc: Wang Nan 
Cc: Michael Ellerman 
Signed-off-by: Madhavan Srinivasan 
---
 tools/include/linux/bitmap.h |  2 ++
 tools/lib/bitmap.c   | 18 ++
 tools/perf/builtin-script.c  |  4 +++-
 tools/perf/util/session.c|  4 +++-
 4 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index 28f5493da491..5e98525387dc 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -2,6 +2,7 @@
 #define _PERF_BITOPS_H
 
 #include 
+#include 
 #include 
 
 #define DECLARE_BITMAP(name,bits) \
@@ -10,6 +11,7 @@
 int __bitmap_weight(const unsigned long *bitmap, int bits);
 void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1,
 const unsigned long *bitmap2, int bits);
+void bitmap_from_u64(unsigned long *dst, u64 mask);
 
 #define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
 
diff --git a/tools/lib/bitmap.c b/tools/lib/bitmap.c
index 0a1adcfd..464a0cc63e6a 100644
--- a/tools/lib/bitmap.c
+++ b/tools/lib/bitmap.c
@@ -29,3 +29,21 @@ void __bitmap_or(unsigned long *dst, const unsigned long 
*bitmap1,
for (k = 0; k < nr; k++)
dst[k] = bitmap1[k] | bitmap2[k];
 }
+
+/*
+ * bitmap_from_u64 - Check and swap words within u64.
+ *  @mask: source bitmap
+ *  @dst:  destination bitmap
+ *
+ * In 32 bit big endian userspace on a 64bit kernel, 'unsigned long' is 32 
bits.
+ * When reading u64 using (u32 *)(&val)[0] and (u32 *)(&val)[1],
+ * we will get wrong value for the mask. That is "(u32 *)(&val)[0]"
+ * gets upper 32 bits of u64, but perf may expect lower 32bits of u64.
+ */
+void bitmap_from_u64(unsigned long *dst, u64 mask)
+{
+   dst[0] = mask & ULONG_MAX;
+
+   if (sizeof(mask) > sizeof(unsigned long))
+   dst[1] = mask >> 32;
+}
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 971ff91b16cb..20d7988a1636 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -418,11 +418,13 @@ static void print_sample_iregs(struct perf_sample *sample,
struct regs_dump *regs = &sample->intr_regs;
uint64_t mask = attr->sample_regs_intr;
unsigned i = 0, r;
+   DECLARE_BITMAP(_mask, 64);
 
if (!regs)
return;
 
-   for_each_set_bit(r, (unsigned long *) &mask, sizeof(mask) * 8) {
+   bitmap_from_u64(_mask, mask);
+   for_each_set_bit(r, _mask, sizeof(mask) * 8) {
u64 val = regs->regs[i++];
printf("%5s:0x%"PRIx64" ", perf_reg_name(r), val);
}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 5d61242a6e64..440a9fb2a6fb 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -944,8 +944,10 @@ static void branch_stack__printf(struct perf_sample 
*sample)
 static void regs_dump__printf(u64 mask, u64 *regs)
 {
unsigned rid, i = 0;
+   DECLARE_BITMAP(_mask, 64);
 
-   for_each_set_bit(rid, (unsigned long *) &mask, sizeof(mask) * 8) {
+   bitmap_from_u64(_mask, mask);
+   for_each_set_bit(rid, _mask, sizeof(mask) * 8) {
u64 val = regs[i++];
 
printf(" %-5s 0x%" PRIx64 "\n",
-- 
2.7.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RESEND PATCH 2/2] perf/core: Fix the mask in perf_output_sample_regs

2016-08-03 Thread Madhavan Srinivasan

When decoding the perf_regs mask in perf_output_sample_regs(),
we loop through the mask using find_first_bit and find_next_bit functions.
While the exisitng code works fine in most of the case,
the logic is broken for 32bit kernel (Big Endian).
When reading u64 mask using (u32 *)(&val)[0], find_*_bit() assumes it gets
lower 32bits of u64 but instead gets upper 32bits which is wrong.
Proposed fix is to swap the words of the u64 to handle this case.
This is _not_ endianness swap.

Suggested-by: Yury Norov 
Reviewed-by: Yury Norov 
Cc: Yury Norov 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Michael Ellerman 
Signed-off-by: Madhavan Srinivasan 
---
 include/linux/bitmap.h |  2 ++
 kernel/events/core.c   |  4 +++-
 lib/bitmap.c   | 19 +++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 27bfc0b631a9..6f2cc9eb12d9 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -188,6 +188,8 @@ extern int bitmap_print_to_pagebuf(bool list, char *buf,
 #define small_const_nbits(nbits) \
(__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG)
 
+extern void bitmap_from_u64(unsigned long *dst, u64 mask);
+
 static inline void bitmap_zero(unsigned long *dst, unsigned int nbits)
 {
if (small_const_nbits(nbits))
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 356a6c7cb52a..f5ed20a63a5e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5269,8 +5269,10 @@ perf_output_sample_regs(struct perf_output_handle 
*handle,
struct pt_regs *regs, u64 mask)
 {
int bit;
+   DECLARE_BITMAP(_mask, 64);
 
-   for_each_set_bit(bit, (const unsigned long *) &mask,
+   bitmap_from_u64(_mask, mask);
+   for_each_set_bit(bit, _mask,
 sizeof(mask) * BITS_PER_BYTE) {
u64 val;
 
diff --git a/lib/bitmap.c b/lib/bitmap.c
index eca88087fa8a..2b9bda507645 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -1170,3 +1170,22 @@ void bitmap_copy_le(unsigned long *dst, const unsigned 
long *src, unsigned int n
 }
 EXPORT_SYMBOL(bitmap_copy_le);
 #endif
+
+/*
+ * bitmap_from_u64 - Check and swap words within u64.
+ *  @mask: source bitmap
+ *  @dst:  destination bitmap
+ *
+ * In 32bit Big Endian kernel, when using (u32 *)(&val)[*]
+ * to read u64 mask, we will get wrong word.
+ * That is "(u32 *)(&val)[0]" gets upper 32 bits,
+ * but expected could be lower 32bits of u64.
+ */
+void bitmap_from_u64(unsigned long *dst, u64 mask)
+{
+   dst[0] = mask & ULONG_MAX;
+
+   if (sizeof(mask) > sizeof(unsigned long))
+   dst[1] = mask >> 32;
+}
+EXPORT_SYMBOL(bitmap_from_u64);
-- 
2.7.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures

On Wednesday, August 3, 2016 10:23:24 AM CEST Stephen Rothwell wrote:
> Hi Luis,
> 
> On Wed, 3 Aug 2016 00:02:43 +0200 "Luis R. Rodriguez"  
> wrote:
> >
> > Thanks for the confirmation. For how long is it known this is broken?
> > Does anyone care and fix these ? Or is this best effort?
> 
> This has been broken for many years 
> 
> I have a couple of times almost fixed it, but it requires that we
> change from using "ld -r" to build the built-in.o objects and some
> changes to the powerpc head.S code ... I will give it another shot now
> that the merge window is almost over (and linux-next goes into its
> quieter time).

Using a different way to link the kernel would also help us with
the remaining allyesconfig problem on ARM, as the problem is only in
'ld -r' not producing trampolines for symbols that later cannot get
them any more. It would probably also help building with ld.gold,
which is currently not working.

What is your suggested alternative?

Arnd
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powernv: Search for new flash DT node location

Quoting Jack Miller (2016-08-02 06:50:35)
> Skiboot will place the flash device tree node at ibm,opal/flash/flash@0
> on P9 and later systems, so Linux needs to search for it there as well
> as ibm,opal/flash@0 for backwards compatibility.
> 
> Signed-off-by: Jack Miller 
> ---
>  arch/powerpc/platforms/powernv/opal.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index ae29eaf..2847cb0 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -755,9 +755,14 @@ static int __init opal_init(void)
>  
> /* Initialize platform devices: IPMI backend, PRD & flash interface */
> opal_pdev_init(opal_node, "ibm,opal-ipmi");
> -   opal_pdev_init(opal_node, "ibm,opal-flash");
> +   opal_pdev_init(opal_node, "ibm,opal-flash"); // old <= P8 flash 
> location
> opal_pdev_init(opal_node, "ibm,opal-prd");
>  
> +   /* New >= P9 flash location */
> +   np = of_get_child_by_name(opal_node, "flash");
> +   if (np)
> +   opal_pdev_init(np, "ibm,opal-flash");

We could instead just search for all nodes that are compatible with
"ibm,opal-flash". We do that for i2c, see opal_i2c_create_devs().

Is there a particular reason not to do that?

cheers
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] rtc-opal: Fix handling of firmware error codes, prevent busy loops