Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-03-15 Thread Madhavan Srinivasan



On Wednesday 15 March 2017 05:53 PM, Peter Zijlstra wrote:

On Wed, Mar 15, 2017 at 05:20:15PM +1100, Michael Ellerman wrote:


I see no implementation; so why are you poking at it.

Maddy has posted an implementation of the kernel part for powerpc in
patch 2 of this series, but maybe you're not on Cc?

I am not indeed. That and a completely inadequate Changelog have lead to
great confusion.


Yes. my bad. I will send out a v3 today and will CC. Also will add
ellerman's explanation to the commit message.

Sorry for the confusion.

Maddy




Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-03-15 Thread Madhavan Srinivasan



On Wednesday 15 March 2017 11:50 AM, Michael Ellerman wrote:

Hi Peter,

Peter Zijlstra  writes:

On Tue, Mar 14, 2017 at 02:31:51PM +0530, Madhavan Srinivasan wrote:


Huh? PPC hasn't yet implemented this? Then why are you fixing it?

yes, PPC hasn't implemented this (until now).

until now where?

On powerpc there is currently no kernel support for filling the data_src
value with anything meaningful.

A user can still request PERF_SAMPLE_DATA_SRC (perf report -d), but they
just get the default value from perf_sample_data_init(), which is
PERF_MEM_NA.

Though even that is currently broken with a big endian perf tool.


And did not understand "Then why are you fixing it?"

I see no implementation; so why are you poking at it.

Maddy has posted an implementation of the kernel part for powerpc in
patch 2 of this series, but maybe you're not on Cc?


Sorry, was out yesterday.

Yes my bad. I CCed lkml and ppcdev and took the emails
from get_maintainer script and added to each file.

I will send out a v3 with peterz and others in all patch.



Regardless of us wanting to do the kernel side on powerpc, the current
API is broken on big endian.

That's because in the kernel the PERF_MEM_NA value is constructed using
shifts:

   /* TLB access */
   #define PERF_MEM_TLB_NA  0x01 /* not available */
   ...
   #define PERF_MEM_TLB_SHIFT   26
   
   #define PERF_MEM_S(a, s) \

(((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
   
   #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\

PERF_MEM_S(LVL, NA)   |\
PERF_MEM_S(SNOOP, NA) |\
PERF_MEM_S(LOCK, NA)  |\
PERF_MEM_S(TLB, NA))

Which works out as:

   ((0x01 << 0) | (0x01 << 5) | (0x01 << 19) | (0x01 << 24) | (0x01 << 26))


Which means the PERF_MEM_NA value comes out of the kernel as 0x5080021
in CPU endian.

But then in the perf tool, the code uses the bitfields to inspect the
value, and currently the bitfields are defined using little endian
ordering.

So eg. in perf_mem__tlb_scnprintf() we see:
   data_src->val = 0x5080021
  op = 0x0
 lvl = 0x0
   snoop = 0x0
lock = 0x0
dtlb = 0x0
rsvd = 0x5080021


So this patch does what I think is the minimal fix, of changing the
definition of the bitfields to match the values that are already
exported by the kernel on big endian. And it makes no change on little
endian.


Thanks for the detailed explanation. I will add this to the patch
commit message in the v3.

Maddy



cheers





Re: [PATCH v3 4/7] macintosh: Only descend into directory when CONFIG_MACINTOSH_DRIVERS is set

2017-03-15 Thread Michael Ellerman
"Andrew F. Davis"  writes:

> When CONFIG_MACINTOSH_DRIVERS is not set make will still descend into the
> macintosh directory but nothing will be built. This produces unneeded
> build artifacts and messages in addition to slowing the build.
> Fix this here.
>
> Signed-off-by: Andrew F. Davis 
> ---
>  drivers/Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

LGTM.

Acked-by: Michael Ellerman 

cheersj


Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-15 Thread David Gibson
On Wed, Mar 15, 2017 at 10:18:18AM -0600, Alex Williamson wrote:
> On Wed, 15 Mar 2017 15:40:14 +1100
> David Gibson  wrote:
> > > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> > > > b/arch/powerpc/kvm/book3s_64_vio.c
> > > > index e96a4590464c..be18cda01e1b 100644
> > > > --- a/arch/powerpc/kvm/book3s_64_vio.c
> > > > +++ b/arch/powerpc/kvm/book3s_64_vio.c
> > > > @@ -28,6 +28,10 @@
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > >  
> > > >  #include 
> > > >  #include 
> > > > @@ -40,6 +44,36 @@
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > > +#include 
> > > > +
> > > > +static void kvm_vfio_group_put_external_user(struct vfio_group 
> > > > *vfio_group)
> > > > +{
> > > > +   void (*fn)(struct vfio_group *);
> > > > +
> > > > +   fn = symbol_get(vfio_group_put_external_user);
> > > > +   if (WARN_ON(!fn))
> > > > +   return;
> > > > +
> > > > +   fn(vfio_group);
> > > > +
> > > > +   symbol_put(vfio_group_put_external_user);
> > > > +}
> > > > +
> > > > +static int kvm_vfio_external_user_iommu_id(struct vfio_group 
> > > > *vfio_group)
> > > > +{
> > > > +   int (*fn)(struct vfio_group *);
> > > > +   int ret = -1;
> > > > +
> > > > +   fn = symbol_get(vfio_external_user_iommu_id);
> > > > +   if (!fn)
> > > > +   return ret;
> > > > +
> > > > +   ret = fn(vfio_group);
> > > > +
> > > > +   symbol_put(vfio_external_user_iommu_id);
> > > > +
> > > > +   return ret;
> > > > +}  
> > > 
> > > 
> > > Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
> > > iommu_group?  Why do you need to hold this additional vfio_group
> > > reference?  
> > 
> > Keeping the vfio_group reference makes sense to me, since we don't
> > want the vfio context for the group to go away while it's attached to
> > the LIOBN.
> 
> But there's already a reference for that, it's taken by
> KVM_DEV_VFIO_GROUP_ADD and held until KVM_DEV_VFIO_GROUP_DEL.  Both the
> DEL path and the cleanup path call kvm_spapr_tce_release_iommu_group()
> before releasing that reference, so it seems entirely redundant.

Oh, good point.  And we already verify that the group has been ADDed
before setting the LIOBN association.

> > However, going via the iommu_id rather than just having an interface
> > to directly grab the iommu group from the vfio_group seems bizarre to
> > me.  I'm ok with cleaning that up later, however.
> 
> We have kvm_spapr_tce_attach_iommu_group() and
> kvm_spapr_tce_release_iommu_group(), but both take a vfio_group, not an
> iommu_group as a parameter.  I don't particularly have a problem with
> the vfio_group -> iommu ID -> iommu_group, but if we drop the extra
> vfio_group reference and pass the iommu_group itself to these functions
> then we can keep all the symbol reference stuff in the kvm-vfio glue
> layer.  Thanks,

Makes sense.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


[PATCH] powerpc: Teach oops about radix vectors

2017-03-15 Thread Michael Neuling
Currently if we oops caused by an 0x380 or 0x480 exception, we get a
print which assumes SLB problems. With radix, these vectors have
different meanings.

This patch updates the oops message to reflect these different
meanings.

Signed-off-by: Michael Neuling 
---
 arch/powerpc/xmon/xmon.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 16321ad9e7..7781f78964 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -1347,9 +1347,19 @@ const char *getvecname(unsigned long vec)
case 0x100: ret = "(System Reset)"; break;
case 0x200: ret = "(Machine Check)"; break;
case 0x300: ret = "(Data Access)"; break;
-   case 0x380: ret = "(Data SLB Access)"; break;
+   case 0x380:
+   if (radix_enabled())
+   ret = "(Data Access Out of Range)";
+   else
+   ret = "(Data SLB Access)";
+   break;
case 0x400: ret = "(Instruction Access)"; break;
-   case 0x480: ret = "(Instruction SLB Access)"; break;
+   case 0x480:
+   if (radix_enabled())
+   ret = "(Instruction Access Out of Range)";
+   else
+   ret = "(Instruction SLB Access)";
+   break;
case 0x500: ret = "(Hardware Interrupt)"; break;
case 0x600: ret = "(Alignment)"; break;
case 0x700: ret = "(Program Check)"; break;
-- 
2.9.3



Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-15 Thread Alexey Kardashevskiy
On 16/03/17 03:39, Alex Williamson wrote:
> On Thu, 16 Mar 2017 00:21:07 +1100
> Alexey Kardashevskiy  wrote:
> 
>> On 15/03/17 08:05, Alex Williamson wrote:
>>> On Fri, 10 Mar 2017 14:53:37 +1100
>>> Alexey Kardashevskiy  wrote:
>>>   
 This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
 and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
 without passing them to user space which saves time on switching
 to user space and back.

 This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
 KVM tries to handle a TCE request in the real mode, if failed
 it passes the request to the virtual mode to complete the operation.
 If it a virtual mode handler fails, the request is passed to
 the user space; this is not expected to happen though.

 To avoid dealing with page use counters (which is tricky in real mode),
 this only accelerates SPAPR TCE IOMMU v2 clients which are required
 to pre-register the userspace memory. The very first TCE request will
 be handled in the VFIO SPAPR TCE driver anyway as the userspace view
 of the TCE table (iommu_table::it_userspace) is not allocated till
 the very first mapping happens and we cannot call vmalloc in real mode.

 If we fail to update a hardware IOMMU table unexpected reason, we just
 clear it and move on as there is nothing really we can do about it -
 for example, if we hot plug a VFIO device to a guest, existing TCE tables
 will be mirrored automatically to the hardware and there is no interface
 to report to the guest about possible failures.

 This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
 the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
 and associates a physical IOMMU table with the SPAPR TCE table (which
 is a guest view of the hardware IOMMU table). The iommu_table object
 is cached and referenced so we do not have to look up for it in real mode.

 This does not implement the UNSET counterpart as there is no use for it -
 once the acceleration is enabled, the existing userspace won't
 disable it unless a VFIO container is destroyed; this adds necessary
 cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

 As this creates a descriptor per IOMMU table-LIOBN couple (called
 kvmppc_spapr_tce_iommu_table), it is possible to have several
 descriptors with the same iommu_table (hardware IOMMU table) attached
 to the same LIOBN; we do not remove duplicates though as
 iommu_table_ops::exchange not just update a TCE entry (which is
 shared among IOMMU groups) but also invalidates the TCE cache
 (one per IOMMU group).

 This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
 space.

 This adds real mode version of WARN_ON_ONCE() as the generic version
 causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
 returns in the code, this also adds a check for already existing
 vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().

 This finally makes use of vfio_external_user_iommu_id() which was
 introduced quite some time ago and was considered for removal.

 Tests show that this patch increases transmission speed from 220MB/s
 to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

 Signed-off-by: Alexey Kardashevskiy 
 ---
 Changes:
 v8:
 * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
 to handle them
 * changed vmalloc_to_phys() callers to return H_HARDWARE
 * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
 and added a comment about this in the code
 * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
 and do WARN_ON
 * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
 have all vmalloc_to_phys() callsites covered

 v7:
 * added realmode-friendly WARN_ON_ONCE_RM

 v6:
 * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
 * moved kvmppc_gpa_to_ua() to TCE validation

 v5:
 * changed error codes in multiple places
 * added bunch of WARN_ON() in places which should not really happen
 * adde a check that an iommu table is not attached already to LIOBN
 * dropped explicit calls to iommu_tce_clear_param_check/
 iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
 call them anyway (since the previous patch)
 * if we fail to update a hardware IOMMU table for unexpected reason,
 this just clears the entry

 v4:
 * added note to the commit log about allowing multiple updates of
 the same IOMMU table;
 * instead of checking for if any memory was preregistered, this
 returns H_TOO_HARD if a specific page was not;
 * fixed 

Re: [PATCH v3 0/7] Remove unneeded build directory traversals

2017-03-15 Thread Andrew F. Davis
On 03/15/2017 04:03 PM, Arnd Bergmann wrote:
> On Wed, Mar 15, 2017 at 5:37 PM, Andrew F. Davis  wrote:
>> Hello all,
>>
>> I was building a kernel for x86 and noticed Make still descended into
>> directories like drivers/gpu/drm/hisilicon, this seems kind of odd given
>> nothing will be built here. It looks to be due to some directories being
>> included in obj-y unconditionally instead of only when the relevant
>> CONFIG_ is set.
>>
>> These patches are split by subsystem in-case, for some reason, a file in
>> a directory does need to be built, I believe I have checked for all
>> instances of this, but a quick review from some maintainers would be nice.
> 
> I didn't see anything wrong with the patches, and made sure that there
> are no tristate symbols controlling the subdirectory for anything that
> requires a built-in driver (which would cause a link failure).
> 
> I'm not sure about drivers/lguest, which has some special magic
> in its Makefile, it's possible that this now fails with CONFIG_LGUEST=m.
> 

lguest and mmc are the strange ones, so I put them last in the series in
case they did need to be dropped.

lguest was supposed to have been taken from v1:
https://lkml.org/lkml/2016/6/20/1086
but it looks like it didn't so I re-introduced it for v3.

mmc caught some 0-day build warnings but I never got to the bottom of them.

Anyway, I have no problem with these two being held back until the magic
in their Makefile is sorted out.

Thanks,
Andrew

>   Arnd
> 


RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-15 Thread Daniel Axtens
Hi David,

> While not part of this change, the unrolled loops look as though
> they just destroy the cpu cache.
> I'd like be convinced that anything does CRC over long enough buffers
> to make it a gain at all.
>
> With modern (not that modern now) superscalar cpus you can often
> get the loop instructions 'for free'.
> Sometimes pipelining the loop is needed to get full throughput.
> Unlike the IP checksum, you don't even have to 'loop carry' the
> cpu carry flag.

Internal testing on a NVMe device with T10DIF enabled on 4k blocks
shows a 20x - 30x improvement. Without these patches, crc_t10dif_generic
uses over 60% of CPU time - with these patches CRC drops to single
digits.

I should probably have lead with that, sorry.

FWIW, the original patch showed a 3.7x gain on btrfs as well -
6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c")

When Anton wrote the original code he had access to IBM's internal
tooling for looking at how instructions flow through the various stages
of the CPU, so I trust it's pretty much optimal from that point of view.

Regards,
Daniel


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-15 Thread Till Smejkal
On Wed, 15 Mar 2017, Rich Felker wrote:
> On Wed, Mar 15, 2017 at 12:44:47PM -0700, Till Smejkal wrote:
> > On Wed, 15 Mar 2017, Andy Lutomirski wrote:
> > > > One advantage of VAS segments is that they can be globally queried by 
> > > > user programs
> > > > which means that VAS segments can be shared by applications that not 
> > > > necessarily have
> > > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data 
> > > > will only work
> > > > if the tasks that share the memory region are related (aka. have a 
> > > > common parent that
> > > > initialized the shared mapping). Otherwise, the shared mapping have to 
> > > > be backed by a
> > > > file.
> > > 
> > > What's wrong with memfd_create()?
> > > 
> > > > VAS segments on the other side allow sharing of pure in memory data by
> > > > arbitrary related tasks without the need of a file. This becomes 
> > > > especially
> > > > interesting if one combines VAS segments with non-volatile memory since 
> > > > one can keep
> > > > data structures in the NVM and still be able to share them between 
> > > > multiple tasks.
> > > 
> > > What's wrong with regular mmap?
> > 
> > I never wanted to say that there is something wrong with regular mmap. We 
> > just
> > figured that with VAS segments you could remove the need to mmap your 
> > shared data but
> > instead can keep everything purely in memory.
> > 
> > Unfortunately, I am not at full speed with memfds. Is my understanding 
> > correct that
> > if the last user of such a file descriptor closes it, the corresponding 
> > memory is
> > freed? Accordingly, memfd cannot be used to keep data in memory while no 
> > program is
> > currently using it, can it? To be able to do this you need again some 
> > representation
> 
> I have a name for application-allocated kernel resources that persist
> without a process holding a reference to them or a node in the
> filesystem: a bug. See: sysvipc.

VAS segments are first class citizens of the OS similar to processes. 
Accordingly, I
would not see this behavior as a bug. VAS segments are a kernel handle to
"persistent" memory (in the sense that they are independent of the lifetime of 
the
application that created them). That means the memory that is described by VAS
segments can be reused by other applications even if the VAS segment was not 
used by
any application in between. It is very much like a pure in-memory file. An
application creates a VAS segment, fills it with content and if it does not 
delete it
again, can reuse/open it again later. This also means, that if you know that you
never want to use this memory again you have to remove it explicitly, like you 
have
to remove a file, if you don't want to use it anymore.

I think it really might be better to implement VAS segments (if I should keep 
this
feature at all) with a special purpose filesystem. The way I've designed it 
seams to
be very misleading.

Till


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-15 Thread Andy Lutomirski
On Wed, Mar 15, 2017 at 12:44 PM, Till Smejkal
 wrote:
> On Wed, 15 Mar 2017, Andy Lutomirski wrote:
>> > One advantage of VAS segments is that they can be globally queried by user 
>> > programs
>> > which means that VAS segments can be shared by applications that not 
>> > necessarily have
>> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data 
>> > will only work
>> > if the tasks that share the memory region are related (aka. have a common 
>> > parent that
>> > initialized the shared mapping). Otherwise, the shared mapping have to be 
>> > backed by a
>> > file.
>>
>> What's wrong with memfd_create()?
>>
>> > VAS segments on the other side allow sharing of pure in memory data by
>> > arbitrary related tasks without the need of a file. This becomes especially
>> > interesting if one combines VAS segments with non-volatile memory since 
>> > one can keep
>> > data structures in the NVM and still be able to share them between 
>> > multiple tasks.
>>
>> What's wrong with regular mmap?
>
> I never wanted to say that there is something wrong with regular mmap. We just
> figured that with VAS segments you could remove the need to mmap your shared 
> data but
> instead can keep everything purely in memory.

memfd does that.

>
> Unfortunately, I am not at full speed with memfds. Is my understanding 
> correct that
> if the last user of such a file descriptor closes it, the corresponding 
> memory is
> freed? Accordingly, memfd cannot be used to keep data in memory while no 
> program is
> currently using it, can it?

No, stop right here.  If you want to have a bunch of memory that
outlives the program that allocates it, use a filesystem (tmpfs,
hugetlbfs, ext4, whatever).  Don't create new persistent kernel
things.

> VAS segments on the other side would provide a functionality to
> achieve the same without the need of any mounted filesystem. However, I 
> agree, that
> this is just a small advantage compared to what can already be achieved with 
> the
> existing functionality provided by the Linux kernel.

I see this "small advantage" as "resource leak and security problem".

>> This sounds complicated and fragile.  What happens if a heuristically
>> shared region coincides with a region in the "first class address
>> space" being selected?
>
> If such a conflict happens, the task cannot use the first class address space 
> and the
> corresponding system call will return an error. However, with the current 
> available
> virtual address space size that programs can use, such conflicts are probably 
> rare.

A bug that hits 1% of the time is often worse than one that hits 100%
of the time because debugging it is miserable.

--Andy


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-15 Thread Rich Felker
On Wed, Mar 15, 2017 at 12:44:47PM -0700, Till Smejkal wrote:
> On Wed, 15 Mar 2017, Andy Lutomirski wrote:
> > > One advantage of VAS segments is that they can be globally queried by 
> > > user programs
> > > which means that VAS segments can be shared by applications that not 
> > > necessarily have
> > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data 
> > > will only work
> > > if the tasks that share the memory region are related (aka. have a common 
> > > parent that
> > > initialized the shared mapping). Otherwise, the shared mapping have to be 
> > > backed by a
> > > file.
> > 
> > What's wrong with memfd_create()?
> > 
> > > VAS segments on the other side allow sharing of pure in memory data by
> > > arbitrary related tasks without the need of a file. This becomes 
> > > especially
> > > interesting if one combines VAS segments with non-volatile memory since 
> > > one can keep
> > > data structures in the NVM and still be able to share them between 
> > > multiple tasks.
> > 
> > What's wrong with regular mmap?
> 
> I never wanted to say that there is something wrong with regular mmap. We just
> figured that with VAS segments you could remove the need to mmap your shared 
> data but
> instead can keep everything purely in memory.
> 
> Unfortunately, I am not at full speed with memfds. Is my understanding 
> correct that
> if the last user of such a file descriptor closes it, the corresponding 
> memory is
> freed? Accordingly, memfd cannot be used to keep data in memory while no 
> program is
> currently using it, can it? To be able to do this you need again some 
> representation

I have a name for application-allocated kernel resources that persist
without a process holding a reference to them or a node in the
filesystem: a bug. See: sysvipc.

Rich


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-15 Thread Till Smejkal
On Wed, 15 Mar 2017, Andy Lutomirski wrote:
> > One advantage of VAS segments is that they can be globally queried by user 
> > programs
> > which means that VAS segments can be shared by applications that not 
> > necessarily have
> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will 
> > only work
> > if the tasks that share the memory region are related (aka. have a common 
> > parent that
> > initialized the shared mapping). Otherwise, the shared mapping have to be 
> > backed by a
> > file.
> 
> What's wrong with memfd_create()?
> 
> > VAS segments on the other side allow sharing of pure in memory data by
> > arbitrary related tasks without the need of a file. This becomes especially
> > interesting if one combines VAS segments with non-volatile memory since one 
> > can keep
> > data structures in the NVM and still be able to share them between multiple 
> > tasks.
> 
> What's wrong with regular mmap?

I never wanted to say that there is something wrong with regular mmap. We just
figured that with VAS segments you could remove the need to mmap your shared 
data but
instead can keep everything purely in memory.

Unfortunately, I am not at full speed with memfds. Is my understanding correct 
that
if the last user of such a file descriptor closes it, the corresponding memory 
is
freed? Accordingly, memfd cannot be used to keep data in memory while no 
program is
currently using it, can it? To be able to do this you need again some 
representation
of the data in a file? Yes, you can use a tmpfs to keep the file content in 
memory as
well, or some DAX filesystem to keep the file content in NVM, but this always
requires that such filesystems are mounted in the system that the application is
currently running on. VAS segments on the other side would provide a 
functionality to
achieve the same without the need of any mounted filesystem. However, I agree, 
that
this is just a small advantage compared to what can already be achieved with the
existing functionality provided by the Linux kernel. I probably need to revisit 
the
whole idea of first class virtual address space segments before continuing with 
this
pacthset. Thank you very much for the great feedback.

> >> >> Ick.  Please don't do this.  Can we please keep an mm as just an mm
> >> >> and not make it look magically different depending on which process
> >> >> maps it?  If you need a trampoline (which you do, of course), just
> >> >> write a trampoline in regular user code and map it manually.
> >> >
> >> > Did I understand you correctly that you are proposing that the switching 
> >> > thread
> >> > should make sure by itself that its code, stack, … memory regions are 
> >> > properly setup
> >> > in the new AS before/after switching into it? I think, this would make 
> >> > using first
> >> > class virtual address spaces much more difficult for user applications 
> >> > to the extend
> >> > that I am not even sure if they can be used at all. At the moment, 
> >> > switching into a
> >> > VAS is a very simple operation for an application because the kernel 
> >> > will just simply
> >> > do the right thing.
> >>
> >> Yes.  I think that having the same mm_struct look different from
> >> different tasks is problematic.  Getting it right in the arch code is
> >> going to be nasty.  The heuristics of what to share are also tough --
> >> why would text + data + stack or whatever you're doing be adequate?
> >> What if you're in a thread?  What if two tasks have their stacks in
> >> the same place?
> >
> > The different ASes that a task now can have when it uses first class 
> > virtual address
> > spaces are not realized in the kernel by using only one mm_struct per task 
> > that just
> > looks differently but by using multiple mm_structs - one for each AS that 
> > the task
> > can execute in. When a task attaches a first class virtual address space to 
> > itself to
> > be able to use another AS, the kernel adds a temporary mm_struct to this 
> > task that
> > contains the mappings of the first class virtual address space and the one 
> > shared
> > with the task's original AS. If a thread now wants to switch into this 
> > attached first
> > class virtual address space the kernel only changes the 'mm' and 
> > 'active_mm' pointers
> > in the task_struct of the thread to the temporary mm_struct and performs the
> > corresponding mm_switch operation. The original mm_struct of the thread 
> > will not be
> > changed.
> >
> > Accordingly, I do not magically make mm_structs look differently depending 
> > on the
> > task that uses it, but create temporary mm_structs that only contain 
> > mappings to the
> > same memory regions.
> 
> This sounds complicated and fragile.  What happens if a heuristically
> shared region coincides with a region in the "first class address
> space" being selected?

If such a conflict happens, the task cannot use the first class address space 
and the
corresponding system call will return an error. 

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-15 Thread Matthew Wilcox
On Wed, Mar 15, 2017 at 09:51:31AM -0700, Andy Lutomirski wrote:
> > VAS segments on the other side allow sharing of pure in memory data by
> > arbitrary related tasks without the need of a file. This becomes especially
> > interesting if one combines VAS segments with non-volatile memory since one 
> > can keep
> > data structures in the NVM and still be able to share them between multiple 
> > tasks.
> 
> What's wrong with regular mmap?

I think it's the usual misunderstandings about how to use mmap.
>From the paper:

   Memory-centric computing demands careful organization of the
   virtual address space, but interfaces such as mmap only give limited
   control. Some systems do not support creation of address regions at
   specific offsets. In Linux, for example, mmap does not safely abort if
   a request is made to open a region of memory over an existing region;
   it simply writes over it.

The correct answer of course, is "Don't specify MAP_FIXED".  Specify the
'hint' address, and if you don't get it, either fix up your data structure
pointers, or just abort and complain noisily.


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-15 Thread Andy Lutomirski
On Tue, Mar 14, 2017 at 9:12 AM, Till Smejkal
 wrote:
> On Mon, 13 Mar 2017, Andy Lutomirski wrote:
>> On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal
>>  wrote:
>> > On Mon, 13 Mar 2017, Andy Lutomirski wrote:
>> >> This sounds rather complicated.  Getting TLB flushing right seems
>> >> tricky.  Why not just map the same thing into multiple mms?
>> >
>> > This is exactly what happens at the end. The memory region that is 
>> > described by the
>> > VAS segment will be mapped in the ASes that use the segment.
>>
>> So why is this kernel feature better than just doing MAP_SHARED
>> manually in userspace?
>
> One advantage of VAS segments is that they can be globally queried by user 
> programs
> which means that VAS segments can be shared by applications that not 
> necessarily have
> to be related. If I am not mistaken, MAP_SHARED of pure in memory data will 
> only work
> if the tasks that share the memory region are related (aka. have a common 
> parent that
> initialized the shared mapping). Otherwise, the shared mapping have to be 
> backed by a
> file.

What's wrong with memfd_create()?

> VAS segments on the other side allow sharing of pure in memory data by
> arbitrary related tasks without the need of a file. This becomes especially
> interesting if one combines VAS segments with non-volatile memory since one 
> can keep
> data structures in the NVM and still be able to share them between multiple 
> tasks.

What's wrong with regular mmap?

>
>> >> Ick.  Please don't do this.  Can we please keep an mm as just an mm
>> >> and not make it look magically different depending on which process
>> >> maps it?  If you need a trampoline (which you do, of course), just
>> >> write a trampoline in regular user code and map it manually.
>> >
>> > Did I understand you correctly that you are proposing that the switching 
>> > thread
>> > should make sure by itself that its code, stack, … memory regions are 
>> > properly setup
>> > in the new AS before/after switching into it? I think, this would make 
>> > using first
>> > class virtual address spaces much more difficult for user applications to 
>> > the extend
>> > that I am not even sure if they can be used at all. At the moment, 
>> > switching into a
>> > VAS is a very simple operation for an application because the kernel will 
>> > just simply
>> > do the right thing.
>>
>> Yes.  I think that having the same mm_struct look different from
>> different tasks is problematic.  Getting it right in the arch code is
>> going to be nasty.  The heuristics of what to share are also tough --
>> why would text + data + stack or whatever you're doing be adequate?
>> What if you're in a thread?  What if two tasks have their stacks in
>> the same place?
>
> The different ASes that a task now can have when it uses first class virtual 
> address
> spaces are not realized in the kernel by using only one mm_struct per task 
> that just
> looks differently but by using multiple mm_structs - one for each AS that the 
> task
> can execute in. When a task attaches a first class virtual address space to 
> itself to
> be able to use another AS, the kernel adds a temporary mm_struct to this task 
> that
> contains the mappings of the first class virtual address space and the one 
> shared
> with the task's original AS. If a thread now wants to switch into this 
> attached first
> class virtual address space the kernel only changes the 'mm' and 'active_mm' 
> pointers
> in the task_struct of the thread to the temporary mm_struct and performs the
> corresponding mm_switch operation. The original mm_struct of the thread will 
> not be
> changed.
>
> Accordingly, I do not magically make mm_structs look differently depending on 
> the
> task that uses it, but create temporary mm_structs that only contain mappings 
> to the
> same memory regions.

This sounds complicated and fragile.  What happens if a heuristically
shared region coincides with a region in the "first class address
space" being selected?

I think the right solution is "you're a user program playing virtual
address games -- make sure you do it right".

--Andy


Re: [PATCH v3 0/7] Remove unneeded build directory traversals

2017-03-15 Thread Arnd Bergmann
On Wed, Mar 15, 2017 at 10:15 PM, Andrew F. Davis  wrote:
> On 03/15/2017 04:03 PM, Arnd Bergmann wrote:
>> On Wed, Mar 15, 2017 at 5:37 PM, Andrew F. Davis  wrote:
>>> Hello all,
>>>
>>> I was building a kernel for x86 and noticed Make still descended into
>>> directories like drivers/gpu/drm/hisilicon, this seems kind of odd given
>>> nothing will be built here. It looks to be due to some directories being
>>> included in obj-y unconditionally instead of only when the relevant
>>> CONFIG_ is set.
>>>
>>> These patches are split by subsystem in-case, for some reason, a file in
>>> a directory does need to be built, I believe I have checked for all
>>> instances of this, but a quick review from some maintainers would be nice.
>>
>> I didn't see anything wrong with the patches, and made sure that there
>> are no tristate symbols controlling the subdirectory for anything that
>> requires a built-in driver (which would cause a link failure).
>>
>> I'm not sure about drivers/lguest, which has some special magic
>> in its Makefile, it's possible that this now fails with CONFIG_LGUEST=m.
>>
>
> lguest and mmc are the strange ones, so I put them last in the series in
> case they did need to be dropped.
>
> lguest was supposed to have been taken from v1:
> https://lkml.org/lkml/2016/6/20/1086
> but it looks like it didn't so I re-introduced it for v3.
>
> mmc caught some 0-day build warnings but I never got to the bottom of them.

Ah, I see now what happened to mmc:

obj-$(subst m,y,$(CONFIG_MMC))  += host/
tmio_mmc_core-$(subst m,y,$(CONFIG_MMC_SDHI))   += tmio_mmc_dma.o
obj-$(subst m,y,$(CONFIG_MMC_SDHCI_PCI))+= sdhci-pci-data.o

with CONFIG_MMC=m, this will fail to build the built-in files in
drivers/mmc/host. I suppose this could be expressed in a different
way these days, but dropping the patch would be easier.

 Arnd


Re: [PATCH v3 0/7] Remove unneeded build directory traversals

2017-03-15 Thread Arnd Bergmann
On Wed, Mar 15, 2017 at 5:37 PM, Andrew F. Davis  wrote:
> Hello all,
>
> I was building a kernel for x86 and noticed Make still descended into
> directories like drivers/gpu/drm/hisilicon, this seems kind of odd given
> nothing will be built here. It looks to be due to some directories being
> included in obj-y unconditionally instead of only when the relevant
> CONFIG_ is set.
>
> These patches are split by subsystem in-case, for some reason, a file in
> a directory does need to be built, I believe I have checked for all
> instances of this, but a quick review from some maintainers would be nice.

I didn't see anything wrong with the patches, and made sure that there
are no tristate symbols controlling the subdirectory for anything that
requires a built-in driver (which would cause a link failure).

I'm not sure about drivers/lguest, which has some special magic
in its Makefile, it's possible that this now fails with CONFIG_LGUEST=m.

  Arnd


Re: [GIT PULL 00/19] perf/core improvements and fixes

2017-03-15 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit 84e5b549214f2160c12318aac549de85f600c79a:
> 
>   Merge tag 'perf-core-for-mingo-4.11-20170306' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> (2017-03-07 08:14:14 +0100)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.12-20170314
> 
> for you to fetch changes up to 5f6bee34707973ea7879a7857fd63ddccc92fff3:
> 
>   kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL 
> (2017-03-14 15:17:40 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> New features:
> 
> - Add PERF_RECORD_NAMESPACES so that the kernel can record information
>   required to associate samples to namespaces, helping in container
>   problem characterization.
> 
>   Now the 'perf record has a --namespace' option to ask for such info,
>   and when present, it can be used, initially, via a new sort order,
>   'cgroup_id', allowing histogram entry bucketization by a (device, inode)
>   based cgroup identifier (Hari Bathini)
> 
> - Add --next option to 'perf sched timehist', showing what is the next
>   thread to run (Brendan Gregg)
> 
> Fixes:
> 
> - Fix segfault with basic block 'cycles' sort dimension (Changbin Du)
> 
> - Add c2c to command-list.txt, making it appear in the 'perf help'
>   output (Changbin Du)
> 
> - Fix zeroing of 'abs_path' variable in the perf hists browser switch
>   file code (Changbin Du)
> 
> - Hide tips messages when -q/--quiet is given to 'perf report' (Namhyung Kim)
> 
> Infrastructure:
> 
> - Use ref_reloc_sym + offset to setup kretprobes (Naveen Rao)
> 
> - Ignore generated files pmu-events/{jevents,pmu-events.c} for git (Changbin 
> Du)
> 
> Documentation:
> 
> - Document +field style argument support for --field option (Changbin Du)
> 
> - Clarify 'perf c2c --stats' help message (Namhyung Kim)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Brendan Gregg (1):
>   perf sched timehist: Add --next option
> 
> Changbin Du (5):
>   perf tools: Missing c2c command in command-list
>   perf tools: Ignore generated files pmu-events/{jevents,pmu-events.c} 
> for git
>   perf sort: Fix segfault with basic block 'cycles' sort dimension
>   perf report: Document +field style argument support for --field option
>   perf hists browser: Fix typo in function switch_data_file
> 
> Hari Bathini (5):
>   perf: Add PERF_RECORD_NAMESPACES to include namespaces related info
>   perf tools: Add PERF_RECORD_NAMESPACES to include namespaces related 
> info
>   perf record: Synthesize namespace events for current processes
>   perf script: Add script print support for namespace events
>   perf tools: Add 'cgroup_id' sort order keyword
> 
> Namhyung Kim (3):
>   perf report: Hide tip message when -q option is given
>   perf c2c: Clarify help message of --stats option
>   perf c2c: Fix display bug when using pipe
> 
> Naveen N. Rao (5):
>   perf probe: Factor out the ftrace README scanning
>   perf kretprobes: Offset from reloc_sym if kernel supports it
>   perf powerpc: Choose local entry point with kretprobes
>   doc: trace/kprobes: add information about NOKPROBE_SYMBOL
>   kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL
> 
>  Documentation/trace/kprobetrace.txt |   5 +-
>  include/linux/perf_event.h  |   2 +
>  include/uapi/linux/perf_event.h |  32 +-
>  kernel/events/core.c| 139 ++
>  kernel/fork.c   |   2 +
>  kernel/kprobes.c|   5 +-
>  kernel/nsproxy.c|   3 +
>  tools/include/uapi/linux/perf_event.h   |  32 +-
>  tools/perf/.gitignore   |   2 +
>  tools/perf/Documentation/perf-record.txt|   3 +
>  tools/perf/Documentation/perf-report.txt|   7 +-
>  tools/perf/Documentation/perf-sched.txt |   4 +
>  tools/perf/Documentation/perf-script.txt|   3 +
>  tools/perf/arch/powerpc/util/sym-handling.c |  14 ++-
>  tools/perf/builtin-annotate.c   |   1 +
>  tools/perf/builtin-c2c.c|   4 +-
>  tools/perf/builtin-diff.c   |   1 +
>  tools/perf/builtin-inject.c |  13 +++
>  tools/perf/builtin-kmem.c   |   1 +
>  tools/perf/builtin-kvm.c|   2 +
>  tools/perf/builtin-lock.c   |   1 +
>  tools/perf/builtin-mem.c|   1 +
>  tools/perf/builtin-record.c |  35 ++-
>  tools/perf/builtin-report.c   

Re: ioctl structs differ from x86_64?

2017-03-15 Thread Reza Arbab

On Wed, Mar 15, 2017 at 01:11:19PM -0500, Reza Arbab wrote:

https://groups.google.com/forum/#!topic/golang-nuts/K5NoG8slez0


Oops.

https://groups.google.com/d/msg/golang-nuts/K5NoG8slez0/mixUse17iaMJ

--
Reza Arbab



Re: [PATCH] drivers/pcmcia: NO_IRQ removal for electra_cf.c

2017-03-15 Thread Scott Wood
On Wed, 2017-03-15 at 16:35 +1100, Michael Ellerman wrote:
> Arnd Bergmann  writes:
> 
> > 
> > On Tue, Mar 14, 2017 at 11:51 AM, Michael Ellerman 
> > wrote:
> > > 
> > > Michael Ellerman  writes:
> > > 
> > > > 
> > > > We'd like to eventually remove NO_IRQ on powerpc, so remove usages of
> > > > it
> > > > from electra_cf.c which is a powerpc-only driver.
> > > > 
> > > > Signed-off-by: Michael Ellerman 
> > > > ---
> > > >  drivers/pcmcia/electra_cf.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > Ping anyone?
> > > 
> > > Or should I merge this via the powerpc tree?
> > That's what I would recommend for a powerpc specific pcmcia driver, yes.
> Suits me.
> 
> > 
> > Looking at the bigger picture of powerpc drivers using NO_IRQ, I also
> > see these others:
> > 
> > drivers/ata/pata_mpc52xx.c: if (ata_irq == NO_IRQ) {
> > drivers/ata/sata_dwc_460ex.c:#ifndef NO_IRQ
> > drivers/ata/sata_dwc_460ex.c:#define NO_IRQ 0
> > drivers/ata/sata_dwc_460ex.c:   if (hsdev->dma->irq == NO_IRQ) {
> > drivers/ata/sata_dwc_460ex.c:   if (irq == NO_IRQ) {
> > drivers/iommu/fsl_pamu.c:   if (irq == NO_IRQ) {
> > drivers/iommu/fsl_pamu.c:   if (irq != NO_IRQ)
> > drivers/media/platform/fsl-viu.c:   if (viu_irq == NO_IRQ) {
> > drivers/mtd/nand/mpc5121_nfc.c: if (prv->irq == NO_IRQ) {
> > drivers/pcmcia/electra_cf.c:cf->irq = NO_IRQ;
> > drivers/pcmcia/electra_cf.c:if (cf->irq != NO_IRQ)
> > drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if
> > no interrupt is pending. */
> > drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ;
> > drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if
> > no interrupt is pending. */
> > drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ;
> > drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_low == NO_IRQ) {
> > drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_high != NO_IRQ &&
> > drivers/spi/spi-mpc52xx.c:  if (status && (irq != NO_IRQ))
> > drivers/tty/ehv_bytechan.c: if (stdout_irq == NO_IRQ) {
> > drivers/tty/ehv_bytechan.c: if ((bc->rx_irq == NO_IRQ) ||
> > (bc->tx_irq == NO_IRQ)) {
> > drivers/tty/serial/cpm_uart/cpm_uart_core.c:if (pinfo->port.irq ==
> > NO_IRQ) {
> > drivers/uio/uio_fsl_elbc_gpcm.c:if (irq != NO_IRQ) {
> > drivers/uio/uio_fsl_elbc_gpcm.c:irq = NO_IRQ;
> > drivers/uio/uio_fsl_elbc_gpcm.c: irq != NO_IRQ ? irq :
> > -1);
> > drivers/usb/host/ehci-grlib.c:  if (irq == NO_IRQ) {
> > drivers/usb/host/ehci-ppc-of.c: if (irq == NO_IRQ) {
> > drivers/usb/host/fhci-hcd.c:if (usb_irq == NO_IRQ) {
> > drivers/usb/host/ohci-ppc-of.c: if (irq == NO_IRQ) {
> > drivers/usb/host/uhci-grlib.c:  if (irq == NO_IRQ) {
> > drivers/video/fbdev/mb862xx/mb862xxfbdrv.c: if (par->irq == NO_IRQ) {
> > drivers/virt/fsl_hypervisor.c:  if (!handle || (irq == NO_IRQ)) {
> > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
> > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
> > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
> > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
> > include/soc/fsl/qe/qe_ic.h: if (cascade_irq == NO_IRQ)
> > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
> > 
> > Did you have other pending patches for those?
> No. I stayed away from anything FSL related as I was under the
> impression some of them were being ported to arch/arm, which uses -1 for
> NO_IRQ IIUIC.
> 
> eg. all of include/soc/fsl and drivers/soc/fsl was moved from
> arch/powerpc in commit 7aa1aa6ecec2, which said:
> 
> QE: Move QE from arch/powerpc to drivers/soc
> 
> ls1 has qe and ls1 has arm cpu.
> move qe from arch/powerpc to drivers/soc/fsl
> to adapt to powerpc and arm
> 
> But looking at the Kconfigs it looks like they're still only selectable
> on PPC. So that's a bit annoying.
> 
> I'll do patches for everything above that's not drivers/soc or
> include/soc and hopefully we can hear from someone at NXP on the plans
> for getting the soc parts enabled on arm.

qe_ic is handled by https://lkml.org/lkml/2017/3/13/1234

-Scott



Applied "ASoC: fsl: constify snd_soc_ops structures" to the asoc tree

2017-03-15 Thread Mark Brown
The patch

   ASoC: fsl: constify snd_soc_ops structures

has been applied to the asoc tree at

   git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git 

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From 5ace37bd7947e28dec5559a57ddc6e1d997dbec5 Mon Sep 17 00:00:00 2001
From: Bhumika Goyal 
Date: Tue, 14 Mar 2017 00:42:22 +0530
Subject: [PATCH] ASoC: fsl: constify snd_soc_ops structures

Declare snd_soc_ops structures as const as they are only stored
in the ops field of a snd_soc_dai_link structure. This field is
of type const, so snd_soc_ops structures having this property
can be made const too.

The following .o files did not compile:
sound/soc/fsl/{p1022_rdk.c/p1022_ds.c/mpc8610_hpcd.c}

Signed-off-by: Bhumika Goyal 
Signed-off-by: Mark Brown 
---
 sound/soc/fsl/eukrea-tlv320.c   | 2 +-
 sound/soc/fsl/imx-mc13783.c | 2 +-
 sound/soc/fsl/mpc8610_hpcd.c| 2 +-
 sound/soc/fsl/mx27vis-aic32x4.c | 2 +-
 sound/soc/fsl/p1022_ds.c| 2 +-
 sound/soc/fsl/p1022_rdk.c   | 2 +-
 sound/soc/fsl/phycore-ac97.c| 2 +-
 sound/soc/fsl/wm1133-ev1.c  | 2 +-
 8 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/sound/soc/fsl/eukrea-tlv320.c b/sound/soc/fsl/eukrea-tlv320.c
index 883087f2b092..84ef6385736c 100644
--- a/sound/soc/fsl/eukrea-tlv320.c
+++ b/sound/soc/fsl/eukrea-tlv320.c
@@ -64,7 +64,7 @@ static int eukrea_tlv320_hw_params(struct snd_pcm_substream 
*substream,
return 0;
 }
 
-static struct snd_soc_ops eukrea_tlv320_snd_ops = {
+static const struct snd_soc_ops eukrea_tlv320_snd_ops = {
.hw_params  = eukrea_tlv320_hw_params,
 };
 
diff --git a/sound/soc/fsl/imx-mc13783.c b/sound/soc/fsl/imx-mc13783.c
index bb0459018b45..9d19b808f634 100644
--- a/sound/soc/fsl/imx-mc13783.c
+++ b/sound/soc/fsl/imx-mc13783.c
@@ -48,7 +48,7 @@ static int imx_mc13783_hifi_hw_params(struct 
snd_pcm_substream *substream,
return snd_soc_dai_set_tdm_slot(cpu_dai, 0x3, 0x3, 2, 16);
 }
 
-static struct snd_soc_ops imx_mc13783_hifi_ops = {
+static const struct snd_soc_ops imx_mc13783_hifi_ops = {
.hw_params = imx_mc13783_hifi_hw_params,
 };
 
diff --git a/sound/soc/fsl/mpc8610_hpcd.c b/sound/soc/fsl/mpc8610_hpcd.c
index ddf49f30b23f..a639b52c16f6 100644
--- a/sound/soc/fsl/mpc8610_hpcd.c
+++ b/sound/soc/fsl/mpc8610_hpcd.c
@@ -174,7 +174,7 @@ static int mpc8610_hpcd_machine_remove(struct snd_soc_card 
*card)
 /**
  * mpc8610_hpcd_ops: ASoC machine driver operations
  */
-static struct snd_soc_ops mpc8610_hpcd_ops = {
+static const struct snd_soc_ops mpc8610_hpcd_ops = {
.startup = mpc8610_hpcd_startup,
 };
 
diff --git a/sound/soc/fsl/mx27vis-aic32x4.c b/sound/soc/fsl/mx27vis-aic32x4.c
index 198eeb3f3f7a..d7ec3d20065c 100644
--- a/sound/soc/fsl/mx27vis-aic32x4.c
+++ b/sound/soc/fsl/mx27vis-aic32x4.c
@@ -73,7 +73,7 @@ static int mx27vis_aic32x4_hw_params(struct snd_pcm_substream 
*substream,
return 0;
 }
 
-static struct snd_soc_ops mx27vis_aic32x4_snd_ops = {
+static const struct snd_soc_ops mx27vis_aic32x4_snd_ops = {
.hw_params  = mx27vis_aic32x4_hw_params,
 };
 
diff --git a/sound/soc/fsl/p1022_ds.c b/sound/soc/fsl/p1022_ds.c
index a1f780ecadf5..41c623c55c16 100644
--- a/sound/soc/fsl/p1022_ds.c
+++ b/sound/soc/fsl/p1022_ds.c
@@ -184,7 +184,7 @@ static int p1022_ds_machine_remove(struct snd_soc_card 
*card)
 /**
  * p1022_ds_ops: ASoC machine driver operations
  */
-static struct snd_soc_ops p1022_ds_ops = {
+static const struct snd_soc_ops p1022_ds_ops = {
.startup = p1022_ds_startup,
 };
 
diff --git a/sound/soc/fsl/p1022_rdk.c b/sound/soc/fsl/p1022_rdk.c
index d4d88a8cb9c0..4afbdd610bfa 100644
--- a/sound/soc/fsl/p1022_rdk.c
+++ b/sound/soc/fsl/p1022_rdk.c
@@ -188,7 +188,7 @@ static int p1022_rdk_machine_remove(struct snd_soc_card 
*card)
 /**
  * p1022_rdk_ops: ASoC machine driver operations
  */
-static struct snd_soc_ops p1022_rdk_ops = {
+static const struct snd_soc_ops p1022_rdk_ops = {
.startup = p1022_rdk_startup,
 };
 
diff --git a/sound/soc/fsl/phycore-ac97.c b/sound/soc/fsl/phycore-ac97.c
index ae403c29688f..66fb6c4614d2 100644
--- a/sound/soc/fsl/phycore-ac97.c
+++ b/sound/soc/fsl/phycore-ac97.c
@@ -23,7 +23,7 @@
 
 static struct 

Re: ioctl structs differ from x86_64?

2017-03-15 Thread Reza Arbab

On Tue, Mar 14, 2017 at 10:37:44AM +, Harshal Patil wrote:
Our guess is the ioctls in ppc64le differ than x86_64, and thats why 
the code which is clearing onclr bit 
([4]https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR164)

is failing on ppc64le but works fine on x86_64. 
 
Any pointers on the possible solution would be really helpful. 


This looks like a bug in Go.

The syscall.TCGETS and syscall.TCSETS constants have the wrong values. 
They were generated using the glibc termios struct instead of the kernel 
termios struct. It's the issue described here:


https://groups.google.com/forum/#!topic/golang-nuts/K5NoG8slez0

Things work if you replace syscall.TCGETS with 0x402c7413 and 
syscall.TCSETS with 0x802c7414, the correct values on ppc64le.


--
Reza Arbab



[PATCH v3 4/7] macintosh: Only descend into directory when CONFIG_MACINTOSH_DRIVERS is set

2017-03-15 Thread Andrew F. Davis
When CONFIG_MACINTOSH_DRIVERS is not set make will still descend into the
macintosh directory but nothing will be built. This produces unneeded
build artifacts and messages in addition to slowing the build.
Fix this here.

Signed-off-by: Andrew F. Davis 
---
 drivers/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/Makefile b/drivers/Makefile
index 9cf52524ecab..cd92491bd76b 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -78,7 +78,7 @@ obj-$(CONFIG_LIBNVDIMM)   += nvdimm/
 obj-$(CONFIG_DEV_DAX)  += dax/
 obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/
 obj-$(CONFIG_NUBUS)+= nubus/
-obj-y  += macintosh/
+obj-$(CONFIG_MACINTOSH_DRIVERS)+= macintosh/
 obj-$(CONFIG_IDE)  += ide/
 obj-$(CONFIG_SCSI) += scsi/
 obj-y  += nvme/
-- 
2.11.0



[PATCH v3 6/7] lguest: Only descend into lguest directory when CONFIG_LGUEST is set

2017-03-15 Thread Andrew F. Davis
When CONFIG_LGUEST is not set make will still descend into the lguest
directory but nothing will be built. This produces unneeded build
artifacts and messages in addition to slowing the build. Fix this here.

Signed-off-by: Andrew F. Davis 
---
 drivers/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/Makefile b/drivers/Makefile
index 8a9ed5c59778..3d758020d248 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -126,7 +126,7 @@ obj-$(CONFIG_ACCESSIBILITY) += accessibility/
 obj-$(CONFIG_ISDN) += isdn/
 obj-$(CONFIG_EDAC) += edac/
 obj-$(CONFIG_EISA) += eisa/
-obj-y  += lguest/
+obj-$(CONFIG_LGUEST)   += lguest/
 obj-$(CONFIG_CPU_FREQ) += cpufreq/
 obj-$(CONFIG_CPU_IDLE) += cpuidle/
 obj-y  += mmc/
-- 
2.11.0



[PATCH v3 0/7] Remove unneeded build directory traversals

2017-03-15 Thread Andrew F. Davis
Hello all,

I was building a kernel for x86 and noticed Make still descended into
directories like drivers/gpu/drm/hisilicon, this seems kind of odd given
nothing will be built here. It looks to be due to some directories being
included in obj-y unconditionally instead of only when the relevant
CONFIG_ is set.

These patches are split by subsystem in-case, for some reason, a file in
a directory does need to be built, I believe I have checked for all
instances of this, but a quick review from some maintainers would be nice.

Thanks,
Andrew

Changes from v2:
 - Removed patches that would not work
 - Rebased on v4.11-rc1 (no changes needed)

Changes from v1:
 - Removed patches already taken by maintainers
 - Rebased on v4.10-rc1 (no changes needed)

Andrew F. Davis (7):
  pwm: Only descend into pwm directory when CONFIG_PWM is set
  amba: Only descend into amba directory when CONFIG_ARM_AMBA is set
  NFC: Only descend into nfc directory when CONFIG_NFC is set
  macintosh: Only descend into directory when CONFIG_MACINTOSH_DRIVERS
is set
  auxdisplay: Only descend into directory when CONFIG_AUXDISPLAY is set
  lguest: Only descend into lguest directory when CONFIG_LGUEST is set
  mmc: Only descend into mmc directory when CONFIG_MMC is set

 drivers/Makefile | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

-- 
2.11.0



[PATCH v3 7/7] mmc: Only descend into mmc directory when CONFIG_MMC is set

2017-03-15 Thread Andrew F. Davis
When CONFIG_MMC is not set make will still descend into the mmc
directory but nothing will be built. This produces unneeded build
artifacts and messages in addition to slowing the build. Fix this here.

Signed-off-by: Andrew F. Davis 
---
 drivers/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/Makefile b/drivers/Makefile
index 3d758020d248..03da7a38e989 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -129,7 +129,7 @@ obj-$(CONFIG_EISA)  += eisa/
 obj-$(CONFIG_LGUEST)   += lguest/
 obj-$(CONFIG_CPU_FREQ) += cpufreq/
 obj-$(CONFIG_CPU_IDLE) += cpuidle/
-obj-y  += mmc/
+obj-$(CONFIG_MMC)  += mmc/
 obj-$(CONFIG_MEMSTICK) += memstick/
 obj-$(CONFIG_NEW_LEDS) += leds/
 obj-$(CONFIG_INFINIBAND)   += infiniband/
-- 
2.11.0



[PATCH v3 3/7] NFC: Only descend into nfc directory when CONFIG_NFC is set

2017-03-15 Thread Andrew F. Davis
When CONFIG_NFC is not set make will still descend into the nfc
directory but nothing will be built. This produces unneeded build
artifacts and messages in addition to slowing the build. Fix this here.

Signed-off-by: Andrew F. Davis 
---
 drivers/Makefile | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/Makefile b/drivers/Makefile
index 23712a92a89a..9cf52524ecab 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -69,7 +69,11 @@ obj-$(CONFIG_FB_INTEL)  += video/fbdev/intelfb/
 
 obj-$(CONFIG_PARPORT)  += parport/
 obj-$(CONFIG_NVM)  += lightnvm/
-obj-y  += base/ block/ misc/ mfd/ nfc/
+obj-y  += base/
+obj-y  += block/
+obj-y  += misc/
+obj-y  += mfd/
+obj-$(CONFIG_NFC)  += nfc/
 obj-$(CONFIG_LIBNVDIMM)+= nvdimm/
 obj-$(CONFIG_DEV_DAX)  += dax/
 obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/
-- 
2.11.0



[PATCH v3 2/7] amba: Only descend into amba directory when CONFIG_ARM_AMBA is set

2017-03-15 Thread Andrew F. Davis
When CONFIG_ARM_AMBA is not set make will still descend into the amba
directory but nothing will be built. This produces unneeded build
artifacts and messages in addition to slowing the build. Fix this here.

Signed-off-by: Andrew F. Davis 
---
 drivers/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/Makefile b/drivers/Makefile
index 00d86749a843..23712a92a89a 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -31,7 +31,7 @@ obj-$(CONFIG_SFI) += sfi/
 # PnP must come after ACPI since it will eventually need to check if acpi
 # was used and do nothing if so
 obj-$(CONFIG_PNP)  += pnp/
-obj-y  += amba/
+obj-$(CONFIG_ARM_AMBA) += amba/
 
 obj-y  += clk/
 # Many drivers will want to use DMA so this has to be made available
-- 
2.11.0



[PATCH v3 1/7] pwm: Only descend into pwm directory when CONFIG_PWM is set

2017-03-15 Thread Andrew F. Davis
When CONFIG_PWM is not set make will still descend into the pwm
directory but nothing will be built. This produces unneeded build
artifacts and messages in addition to slowing the build. Fix this here.

Signed-off-by: Andrew F. Davis 
---
 drivers/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/Makefile b/drivers/Makefile
index 2eced9afba53..00d86749a843 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -13,7 +13,7 @@ obj-$(CONFIG_GENERIC_PHY) += phy/
 # GPIO must come after pinctrl as gpios may need to mux pins etc
 obj-$(CONFIG_PINCTRL)  += pinctrl/
 obj-$(CONFIG_GPIOLIB)  += gpio/
-obj-y  += pwm/
+obj-$(CONFIG_PWM)  += pwm/
 obj-$(CONFIG_PCI)  += pci/
 # PCI dwc controller drivers
 obj-y  += pci/dwc/
-- 
2.11.0



[PATCH v3 5/7] auxdisplay: Only descend into directory when CONFIG_AUXDISPLAY is set

2017-03-15 Thread Andrew F. Davis
When CONFIG_AUXDISPLAY is not set make will still descend into the
auxdisplay directory but nothing will be built. This produces unneeded
build artifacts and messages in addition to slowing the build.
Fix this here.

Signed-off-by: Andrew F. Davis 
---
 drivers/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/Makefile b/drivers/Makefile
index cd92491bd76b..8a9ed5c59778 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -95,7 +95,7 @@ obj-y += firewire/
 obj-$(CONFIG_UIO)  += uio/
 obj-$(CONFIG_VFIO) += vfio/
 obj-y  += cdrom/
-obj-y  += auxdisplay/
+obj-$(CONFIG_AUXDISPLAY)   += auxdisplay/
 obj-$(CONFIG_PCCARD)   += pcmcia/
 obj-$(CONFIG_DIO)  += dio/
 obj-$(CONFIG_SBUS) += sbus/
-- 
2.11.0



Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-15 Thread Alex Williamson
On Thu, 16 Mar 2017 00:21:07 +1100
Alexey Kardashevskiy  wrote:

> On 15/03/17 08:05, Alex Williamson wrote:
> > On Fri, 10 Mar 2017 14:53:37 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> If we fail to update a hardware IOMMU table unexpected reason, we just
> >> clear it and move on as there is nothing really we can do about it -
> >> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> >> will be mirrored automatically to the hardware and there is no interface
> >> to report to the guest about possible failures.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN; we do not remove duplicates though as
> >> iommu_table_ops::exchange not just update a TCE entry (which is
> >> shared among IOMMU groups) but also invalidates the TCE cache
> >> (one per IOMMU group).
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This adds real mode version of WARN_ON_ONCE() as the generic version
> >> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> >> returns in the code, this also adds a check for already existing
> >> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >> Changes:
> >> v8:
> >> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> >> to handle them
> >> * changed vmalloc_to_phys() callers to return H_HARDWARE
> >> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> >> and added a comment about this in the code
> >> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> >> and do WARN_ON
> >> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> >> have all vmalloc_to_phys() callsites covered
> >>
> >> v7:
> >> * added realmode-friendly WARN_ON_ONCE_RM
> >>
> >> v6:
> >> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> >> * moved kvmppc_gpa_to_ua() to TCE validation
> >>
> >> v5:
> >> * changed error codes in multiple places
> >> * added bunch of WARN_ON() in places which should not really happen
> >> * adde a check that an iommu table is not attached already to LIOBN
> >> * dropped explicit calls to iommu_tce_clear_param_check/
> >> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> >> call them anyway (since the previous patch)
> >> * if we fail to update a hardware IOMMU table for unexpected reason,
> >> this just clears the entry
> >>
> >> v4:
> >> * added note to the commit log about allowing multiple updates of
> >> the same IOMMU table;
> >> * instead of checking for if any memory was preregistered, this
> >> returns H_TOO_HARD if a specific page was not;
> >> * fixed comments from v3 about error handling in many places;

Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-15 Thread Alex Williamson
On Wed, 15 Mar 2017 15:40:14 +1100
David Gibson  wrote:
> > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> > > b/arch/powerpc/kvm/book3s_64_vio.c
> > > index e96a4590464c..be18cda01e1b 100644
> > > --- a/arch/powerpc/kvm/book3s_64_vio.c
> > > +++ b/arch/powerpc/kvm/book3s_64_vio.c
> > > @@ -28,6 +28,10 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > >  
> > >  #include 
> > >  #include 
> > > @@ -40,6 +44,36 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > > +
> > > +static void kvm_vfio_group_put_external_user(struct vfio_group 
> > > *vfio_group)
> > > +{
> > > + void (*fn)(struct vfio_group *);
> > > +
> > > + fn = symbol_get(vfio_group_put_external_user);
> > > + if (WARN_ON(!fn))
> > > + return;
> > > +
> > > + fn(vfio_group);
> > > +
> > > + symbol_put(vfio_group_put_external_user);
> > > +}
> > > +
> > > +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> > > +{
> > > + int (*fn)(struct vfio_group *);
> > > + int ret = -1;
> > > +
> > > + fn = symbol_get(vfio_external_user_iommu_id);
> > > + if (!fn)
> > > + return ret;
> > > +
> > > + ret = fn(vfio_group);
> > > +
> > > + symbol_put(vfio_external_user_iommu_id);
> > > +
> > > + return ret;
> > > +}  
> > 
> > 
> > Ugh.  This feels so wrong.  Why can't you have kvm-vfio pass the
> > iommu_group?  Why do you need to hold this additional vfio_group
> > reference?  
> 
> Keeping the vfio_group reference makes sense to me, since we don't
> want the vfio context for the group to go away while it's attached to
> the LIOBN.

But there's already a reference for that, it's taken by
KVM_DEV_VFIO_GROUP_ADD and held until KVM_DEV_VFIO_GROUP_DEL.  Both the
DEL path and the cleanup path call kvm_spapr_tce_release_iommu_group()
before releasing that reference, so it seems entirely redundant.

> However, going via the iommu_id rather than just having an interface
> to directly grab the iommu group from the vfio_group seems bizarre to
> me.  I'm ok with cleaning that up later, however.

We have kvm_spapr_tce_attach_iommu_group() and
kvm_spapr_tce_release_iommu_group(), but both take a vfio_group, not an
iommu_group as a parameter.  I don't particularly have a problem with
the vfio_group -> iommu ID -> iommu_group, but if we drop the extra
vfio_group reference and pass the iommu_group itself to these functions
then we can keep all the symbol reference stuff in the kvm-vfio glue
layer.  Thanks,

Alex


RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-15 Thread David Laight
From: Linuxppc-dev Daniel Axtens
> Sent: 15 March 2017 12:38
> The core nuts and bolts of the crc32c vpmsum algorithm will
> also work for a number of other CRC algorithms with different
> polynomials. Factor out the function into a new asm file.
> 
> To handle multiple users of the function, a user simply
> provides constants, defines the name of their CRC function,
> and then #includes the core algorithm file.
...

While not part of this change, the unrolled loops look as though
they just destroy the cpu cache.
I'd like be convinced that anything does CRC over long enough buffers
to make it a gain at all.

With modern (not that modern now) superscalar cpus you can often
get the loop instructions 'for free'.
Sometimes pipelining the loop is needed to get full throughput.
Unlike the IP checksum, you don't even have to 'loop carry' the
cpu carry flag.

David



[PATCH] powerpc/pasemi, cbe: Do not process decremeter or external wakeup from powersave

2017-03-15 Thread Nicholas Piggin
Hi,

I would like to start using a dedicated stack for system reset interrupt
and treat it as a Linux nmi, which makes it tricky to call complex
interrupt handlers directly from the system reset trap handler.

So I would like to remove the decrementer and external handler calls from
Cell and Pasemi platforms' system reset handler. I think we can just
remove them if they can be handled when they re-fire as normal interrupts?
At the moment I don't have environments set up to test if this works.

This will make the powersave wakeup a bit slower. We could add alternate
code to IDLETEST here to branch out to powersave wakeup handler like
powernv does. I haven't got a patch for that yet but I could help write
one if it is important and it can be tested.

Thanks,
Nick
---
 arch/powerpc/platforms/cell/pervasive.c | 11 +++
 arch/powerpc/platforms/pasemi/idle.c| 11 +++
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/cell/pervasive.c 
b/arch/powerpc/platforms/cell/pervasive.c
index e7d075077cb0..a88944db9fc3 100644
--- a/arch/powerpc/platforms/cell/pervasive.c
+++ b/arch/powerpc/platforms/cell/pervasive.c
@@ -88,11 +88,14 @@ static void cbe_power_save(void)
 static int cbe_system_reset_exception(struct pt_regs *regs)
 {
switch (regs->msr & SRR1_WAKEMASK) {
-   case SRR1_WAKEEE:
-   do_IRQ(regs);
-   break;
case SRR1_WAKEDEC:
-   timer_interrupt(regs);
+   set_dec(1);
+   case SRR1_WAKEEE:
+   /*
+* Handle these when interrupts get re-enabled and we take
+* them as regular exceptions. We are in an NMI context
+* and can't handle these here.
+*/
break;
case SRR1_WAKEMT:
return cbe_sysreset_hack();
diff --git a/arch/powerpc/platforms/pasemi/idle.c 
b/arch/powerpc/platforms/pasemi/idle.c
index 75b296bc51af..44e0d9226f0a 100644
--- a/arch/powerpc/platforms/pasemi/idle.c
+++ b/arch/powerpc/platforms/pasemi/idle.c
@@ -53,11 +53,14 @@ static int pasemi_system_reset_exception(struct pt_regs 
*regs)
regs->nip = regs->link;
 
switch (regs->msr & SRR1_WAKEMASK) {
-   case SRR1_WAKEEE:
-   do_IRQ(regs);
-   break;
case SRR1_WAKEDEC:
-   timer_interrupt(regs);
+   set_dec(1);
+   case SRR1_WAKEEE:
+   /*
+* Handle these when interrupts get re-enabled and we take
+* them as regular exceptions. We are in an NMI context
+* and can't handle these here.
+*/
break;
default:
/* do system reset */
-- 
2.11.0



Re: ioctl structs differ from x86_64?

2017-03-15 Thread Arnd Bergmann
On Tue, Mar 14, 2017 at 11:37 AM, Harshal Patil
 wrote:
> Hello,
>
> I am looking into a bug,
> https://bugzilla.linux.ibm.com/show_bug.cgi?id=152493 ( external mirror is
> at, https://github.com/opencontainers/runc/issues/1364)
>
> Recently in runc code, they added this code
> https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR155.
> As you can see they set -onlcr to get rid of \r (line no. 164). Golang, in
> which runc is written, doesn't have any bindings for ioctls.  This means you
> have to invoke C code directly (that's what they are doing there).
>
> Our guess is the ioctls in ppc64le differ than x86_64, and thats why the
> code which is clearing onclr bit
> (https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR164)
> is failing on ppc64le but works fine on x86_64.
>
> Any pointers on the possible solution would be really helpful.

There are a couple of reasons for ioctl numbers to differ:

- like this one, the command number may be defined in asm/ioctls.h.
Most architectures
  in the kernel are the same, some (arm, blackfin, frv, m68k, and
s390) only differ
  in FIOQSIZE, others (alpha, mips, parisc, powerpc, sh, sparc, and
xtensa) redefine
  all the traditional file (FIO*) and tty (TC*, TIO*) commands

- command numbers that are defined in terms of structure sizes depend on the
  architectures type definitions (e.g. long, off_t, uid_t, ...) that
can differ in both
  size and alignment

- alpha, mips, powerpc and sparc use _IOC_SIZEBITS==13, everything else uses
  _IOC_SIZEBITS==14, and that in turn means that the majority of ioctl commands
  are different between those four and the rest.

  Arnd


Re: [PATCH v3 1/2] powerpc: split ftrace bits into a separate file

2017-03-15 Thread Naveen N. Rao
On 2017/03/15 09:11AM, Steven Rostedt wrote:
> On Wed, 15 Mar 2017 14:35:16 +0530
> "Naveen N. Rao"  wrote:
> 
> > I don't have a strong opinion about this, but I feel that x86 can simply 
> > use ftrace_64.S, seeing as the current name is mcount_64.S.
> > 
> > Other architectures could do something similar too, or fall back to 
> > ftrace_hook.S. That way, all ftrace low-level code can simply be 
> > referred to as arch/*/ftrace_*.S
> 
> Just to clarify, I'm currently working on patches to clean up the
> ftrace.S code in x86, and I renamed mcount_64.S to ftrace_64.S. I'm
> also moving the ftrace code out of entry_32.S into a ftrace_32.S file
> as well. Patches will hopefully be posted soon.

Good to know, thanks for clarifying.

In light of that, I hope that you're ok with the changes proposed to the 
ftrace bits under arch/powerpc in this patchset?

- Naveen



Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-15 Thread Alexey Kardashevskiy
On 15/03/17 08:05, Alex Williamson wrote:
> On Fri, 10 Mar 2017 14:53:37 +1100
> Alexey Kardashevskiy  wrote:
> 
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
>> without passing them to user space which saves time on switching
>> to user space and back.
>>
>> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
>> KVM tries to handle a TCE request in the real mode, if failed
>> it passes the request to the virtual mode to complete the operation.
>> If it a virtual mode handler fails, the request is passed to
>> the user space; this is not expected to happen though.
>>
>> To avoid dealing with page use counters (which is tricky in real mode),
>> this only accelerates SPAPR TCE IOMMU v2 clients which are required
>> to pre-register the userspace memory. The very first TCE request will
>> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
>> of the TCE table (iommu_table::it_userspace) is not allocated till
>> the very first mapping happens and we cannot call vmalloc in real mode.
>>
>> If we fail to update a hardware IOMMU table unexpected reason, we just
>> clear it and move on as there is nothing really we can do about it -
>> for example, if we hot plug a VFIO device to a guest, existing TCE tables
>> will be mirrored automatically to the hardware and there is no interface
>> to report to the guest about possible failures.
>>
>> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
>> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
>> and associates a physical IOMMU table with the SPAPR TCE table (which
>> is a guest view of the hardware IOMMU table). The iommu_table object
>> is cached and referenced so we do not have to look up for it in real mode.
>>
>> This does not implement the UNSET counterpart as there is no use for it -
>> once the acceleration is enabled, the existing userspace won't
>> disable it unless a VFIO container is destroyed; this adds necessary
>> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
>>
>> As this creates a descriptor per IOMMU table-LIOBN couple (called
>> kvmppc_spapr_tce_iommu_table), it is possible to have several
>> descriptors with the same iommu_table (hardware IOMMU table) attached
>> to the same LIOBN; we do not remove duplicates though as
>> iommu_table_ops::exchange not just update a TCE entry (which is
>> shared among IOMMU groups) but also invalidates the TCE cache
>> (one per IOMMU group).
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> This adds real mode version of WARN_ON_ONCE() as the generic version
>> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
>> returns in the code, this also adds a check for already existing
>> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
>>
>> This finally makes use of vfio_external_user_iommu_id() which was
>> introduced quite some time ago and was considered for removal.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>> Changes:
>> v8:
>> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
>> to handle them
>> * changed vmalloc_to_phys() callers to return H_HARDWARE
>> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
>> and added a comment about this in the code
>> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
>> and do WARN_ON
>> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
>> have all vmalloc_to_phys() callsites covered
>>
>> v7:
>> * added realmode-friendly WARN_ON_ONCE_RM
>>
>> v6:
>> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
>> * moved kvmppc_gpa_to_ua() to TCE validation
>>
>> v5:
>> * changed error codes in multiple places
>> * added bunch of WARN_ON() in places which should not really happen
>> * adde a check that an iommu table is not attached already to LIOBN
>> * dropped explicit calls to iommu_tce_clear_param_check/
>> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
>> call them anyway (since the previous patch)
>> * if we fail to update a hardware IOMMU table for unexpected reason,
>> this just clears the entry
>>
>> v4:
>> * added note to the commit log about allowing multiple updates of
>> the same IOMMU table;
>> * instead of checking for if any memory was preregistered, this
>> returns H_TOO_HARD if a specific page was not;
>> * fixed comments from v3 about error handling in many places;
>> * simplified TCE handlers and merged IOMMU parts inline - for example,
>> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
>> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
>> the first attached table only (makes the code simpler);
>>
>> v3:

Re: [PATCH v3 1/2] powerpc: split ftrace bits into a separate file

2017-03-15 Thread Steven Rostedt
On Wed, 15 Mar 2017 14:35:16 +0530
"Naveen N. Rao"  wrote:

> I don't have a strong opinion about this, but I feel that x86 can simply 
> use ftrace_64.S, seeing as the current name is mcount_64.S.
> 
> Other architectures could do something similar too, or fall back to 
> ftrace_hook.S. That way, all ftrace low-level code can simply be 
> referred to as arch/*/ftrace_*.S

Just to clarify, I'm currently working on patches to clean up the
ftrace.S code in x86, and I renamed mcount_64.S to ftrace_64.S. I'm
also moving the ftrace code out of entry_32.S into a ftrace_32.S file
as well. Patches will hopefully be posted soon.

-- Steve


[PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-15 Thread Daniel Axtens
The core nuts and bolts of the crc32c vpmsum algorithm will
also work for a number of other CRC algorithms with different
polynomials. Factor out the function into a new asm file.

To handle multiple users of the function, a user simply
provides constants, defines the name of their CRC function,
and then #includes the core algorithm file.

Cc: Anton Blanchard 
Signed-off-by: Daniel Axtens 

--

It's possible at this point to argue that the address
of the constant tables should be passed in to the function,
rather than doing this somewhat unconventional #include.

However, we're about to add further #ifdef's back into the core
that will be provided by the encapsulaing code, and which couldn't
be done as a variable without performance loss.
---
 arch/powerpc/crypto/crc32-vpmsum_core.S | 726 
 arch/powerpc/crypto/crc32c-vpmsum_asm.S | 714 +--
 2 files changed, 729 insertions(+), 711 deletions(-)
 create mode 100644 arch/powerpc/crypto/crc32-vpmsum_core.S

diff --git a/arch/powerpc/crypto/crc32-vpmsum_core.S 
b/arch/powerpc/crypto/crc32-vpmsum_core.S
new file mode 100644
index ..629244ef170e
--- /dev/null
+++ b/arch/powerpc/crypto/crc32-vpmsum_core.S
@@ -0,0 +1,726 @@
+/*
+ * Core of the accelerated CRC algorithm.
+ * In your file, define the constants and CRC_FUNCTION_NAME
+ * Then include this file.
+ *
+ * Calculate the checksum of data that is 16 byte aligned and a multiple of
+ * 16 bytes.
+ *
+ * The first step is to reduce it to 1024 bits. We do this in 8 parallel
+ * chunks in order to mask the latency of the vpmsum instructions. If we
+ * have more than 32 kB of data to checksum we repeat this step multiple
+ * times, passing in the previous 1024 bits.
+ *
+ * The next step is to reduce the 1024 bits to 64 bits. This step adds
+ * 32 bits of 0s to the end - this matches what a CRC does. We just
+ * calculate constants that land the data in this 32 bits.
+ *
+ * We then use fixed point Barrett reduction to compute a mod n over GF(2)
+ * for n = CRC using POWER8 instructions. We use x = 32.
+ *
+ * http://en.wikipedia.org/wiki/Barrett_reduction
+ *
+ * Copyright (C) 2015 Anton Blanchard , IBM
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+*/
+   
+#include 
+#include 
+
+#define MAX_SIZE   32768
+
+   .text
+
+#if defined(__BIG_ENDIAN__)
+#define BYTESWAP_DATA
+#else
+#undef BYTESWAP_DATA
+#endif
+
+#define off16  r25
+#define off32  r26
+#define off48  r27
+#define off64  r28
+#define off80  r29
+#define off96  r30
+#define off112 r31
+
+#define const1 v24
+#define const2 v25
+
+#define byteswap   v26
+#definemask_32bit  v27
+#definemask_64bit  v28
+#define zeroes v29
+
+#ifdef BYTESWAP_DATA
+#define VPERM(A, B, C, D) vpermA, B, C, D
+#else
+#define VPERM(A, B, C, D)
+#endif
+
+/* unsigned int CRC_FUNCTION_NAME(unsigned int crc, void *p, unsigned long 
len) */
+FUNC_START(CRC_FUNCTION_NAME)
+   std r31,-8(r1)
+   std r30,-16(r1)
+   std r29,-24(r1)
+   std r28,-32(r1)
+   std r27,-40(r1)
+   std r26,-48(r1)
+   std r25,-56(r1)
+
+   li  off16,16
+   li  off32,32
+   li  off48,48
+   li  off64,64
+   li  off80,80
+   li  off96,96
+   li  off112,112
+   li  r0,0
+
+   /* Enough room for saving 10 non volatile VMX registers */
+   subir6,r1,56+10*16
+   subir7,r1,56+2*16
+
+   stvxv20,0,r6
+   stvxv21,off16,r6
+   stvxv22,off32,r6
+   stvxv23,off48,r6
+   stvxv24,off64,r6
+   stvxv25,off80,r6
+   stvxv26,off96,r6
+   stvxv27,off112,r6
+   stvxv28,0,r7
+   stvxv29,off16,r7
+
+   mr  r10,r3
+
+   vxorzeroes,zeroes,zeroes
+   vspltisw v0,-1
+
+   vsldoi  mask_32bit,zeroes,v0,4
+   vsldoi  mask_64bit,zeroes,v0,8
+
+   /* Get the initial value into v8 */
+   vxorv8,v8,v8
+   MTVRD(v8, R3)
+   vsldoi  v8,zeroes,v8,8  /* shift into bottom 32 bits */
+
+#ifdef BYTESWAP_DATA
+   addis   r3,r2,.byteswap_constant@toc@ha
+   addir3,r3,.byteswap_constant@toc@l
+
+   lvx byteswap,0,r3
+   addir3,r3,16
+#endif
+
+   cmpdi   r5,256
+   blt .Lshort
+
+   rldicr  r6,r5,0,56
+
+   /* Checksum in blocks of MAX_SIZE */
+1: lis r7,MAX_SIZE@h
+   ori r7,r7,MAX_SIZE@l
+   mr  r9,r7
+   cmpdr6,r7
+   bgt 2f
+   mr  r7,r6
+2: subfr6,r7,r6
+
+   /* our main loop does 128 bytes at a time */
+   srdi

[PATCH 4/4] crypto: powerpc - Stress test for vpmsum implementations

2017-03-15 Thread Daniel Axtens
vpmsum implementations often don't kick in for short test vectors.
This is a simple test module that does a configurable number of
random tests, each up to 64kB and each with random offsets.

Both CRC-T10DIF and CRC32C are tested.

Cc: Anton Blanchard 
Signed-off-by: Daniel Axtens 

--

Not super fussy about the inclusion or otherwise of this - it was very
useful for debugging my code, and more tests are good :)

Also, I originally found the bug in Anton's CRC32c using this.

Tests pass on both BE 64 bit and LE 64 bit.
---
 arch/powerpc/crypto/Makefile  |   1 +
 arch/powerpc/crypto/crc-vpmsum_test.c | 137 ++
 crypto/Kconfig|   8 ++
 3 files changed, 146 insertions(+)
 create mode 100644 arch/powerpc/crypto/crc-vpmsum_test.c

diff --git a/arch/powerpc/crypto/Makefile b/arch/powerpc/crypto/Makefile
index e66aaf19764d..67eca3af9fc7 100644
--- a/arch/powerpc/crypto/Makefile
+++ b/arch/powerpc/crypto/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_CRYPTO_SHA1_PPC_SPE) += sha1-ppc-spe.o
 obj-$(CONFIG_CRYPTO_SHA256_PPC_SPE) += sha256-ppc-spe.o
 obj-$(CONFIG_CRYPTO_CRC32C_VPMSUM) += crc32c-vpmsum.o
 obj-$(CONFIG_CRYPTO_CRCT10DIF_VPMSUM) += crct10dif-vpmsum.o
+obj-$(CONFIG_CRYPTO_VPMSUM_TESTER) += crc-vpmsum_test.o
 
 aes-ppc-spe-y := aes-spe-core.o aes-spe-keys.o aes-tab-4k.o aes-spe-modes.o 
aes-spe-glue.o
 md5-ppc-y := md5-asm.o md5-glue.o
diff --git a/arch/powerpc/crypto/crc-vpmsum_test.c 
b/arch/powerpc/crypto/crc-vpmsum_test.c
new file mode 100644
index ..d58242557f33
--- /dev/null
+++ b/arch/powerpc/crypto/crc-vpmsum_test.c
@@ -0,0 +1,137 @@
+/*
+ * CRC vpmsum tester
+ * Copyright 2017 Daniel Axtens, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static unsigned long iterations = 1;
+
+#define MAX_CRC_LENGTH 65535
+
+
+static int __init crc_test_init(void)
+{
+   u16 crc16 = 0, verify16 = 0;
+   u32 crc32 = 0, verify32 = 0;
+   __le32 verify32le = 0;
+   unsigned char *data;
+   unsigned long i;
+   int ret;
+
+   struct crypto_shash *crct10dif_tfm;
+   struct crypto_shash *crc32c_tfm;
+   
+   if (!cpu_has_feature(CPU_FTR_ARCH_207S))
+   return -ENODEV;
+   
+   data = kmalloc(MAX_CRC_LENGTH, GFP_KERNEL);
+   if (!data)
+   return -ENOMEM;
+
+   crct10dif_tfm = crypto_alloc_shash("crct10dif", 0, 0);
+
+   if (IS_ERR(crct10dif_tfm)) {
+   pr_err("Error allocating crc-t10dif\n");
+   goto free_buf;
+   }
+
+   crc32c_tfm = crypto_alloc_shash("crc32c", 0, 0);
+
+   if (IS_ERR(crc32c_tfm)) {
+   pr_err("Error allocating crc32c\n");
+   goto free_16;
+   }
+
+   do {
+   SHASH_DESC_ON_STACK(crct10dif_shash, crct10dif_tfm);
+   SHASH_DESC_ON_STACK(crc32c_shash, crc32c_tfm);
+
+   crct10dif_shash->tfm = crct10dif_tfm;
+   ret = crypto_shash_init(crct10dif_shash);
+
+   if (ret) {
+   pr_err("Error initing crc-t10dif\n");
+   goto free_32;
+   }
+   
+
+   crc32c_shash->tfm = crc32c_tfm;
+   ret = crypto_shash_init(crc32c_shash);
+
+   if (ret) {
+   pr_err("Error initing crc32c\n");
+   goto free_32;
+   }
+   
+   pr_info("crc-vpmsum_test begins, %lu iterations\n", iterations);
+   for (i=0; i

[PATCH 3/4] crypto: powerpc - Add CRC-T10DIF acceleration

2017-03-15 Thread Daniel Axtens
T10DIF is a CRC16 used heavily in NVMe.

It turns out we can accelerate it with a CRC32 library and a few
little tricks.

Provide the accelerator based the refactored CRC32 code.

Cc: Anton Blanchard 
Thanks-to: Hong Bo Peng 
Signed-off-by: Daniel Axtens 
---
 arch/powerpc/crypto/Makefile|   2 +
 arch/powerpc/crypto/crct10dif-vpmsum_asm.S  | 850 
 arch/powerpc/crypto/crct10dif-vpmsum_glue.c | 125 
 crypto/Kconfig  |   9 +
 4 files changed, 986 insertions(+)
 create mode 100644 arch/powerpc/crypto/crct10dif-vpmsum_asm.S
 create mode 100644 arch/powerpc/crypto/crct10dif-vpmsum_glue.c

diff --git a/arch/powerpc/crypto/Makefile b/arch/powerpc/crypto/Makefile
index 87f40454bad3..e66aaf19764d 100644
--- a/arch/powerpc/crypto/Makefile
+++ b/arch/powerpc/crypto/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_PPC) += sha1-powerpc.o
 obj-$(CONFIG_CRYPTO_SHA1_PPC_SPE) += sha1-ppc-spe.o
 obj-$(CONFIG_CRYPTO_SHA256_PPC_SPE) += sha256-ppc-spe.o
 obj-$(CONFIG_CRYPTO_CRC32C_VPMSUM) += crc32c-vpmsum.o
+obj-$(CONFIG_CRYPTO_CRCT10DIF_VPMSUM) += crct10dif-vpmsum.o
 
 aes-ppc-spe-y := aes-spe-core.o aes-spe-keys.o aes-tab-4k.o aes-spe-modes.o 
aes-spe-glue.o
 md5-ppc-y := md5-asm.o md5-glue.o
@@ -17,3 +18,4 @@ sha1-powerpc-y := sha1-powerpc-asm.o sha1.o
 sha1-ppc-spe-y := sha1-spe-asm.o sha1-spe-glue.o
 sha256-ppc-spe-y := sha256-spe-asm.o sha256-spe-glue.o
 crc32c-vpmsum-y := crc32c-vpmsum_asm.o crc32c-vpmsum_glue.o
+crct10dif-vpmsum-y := crct10dif-vpmsum_asm.o crct10dif-vpmsum_glue.o
diff --git a/arch/powerpc/crypto/crct10dif-vpmsum_asm.S 
b/arch/powerpc/crypto/crct10dif-vpmsum_asm.S
new file mode 100644
index ..5e3d81a0af1b
--- /dev/null
+++ b/arch/powerpc/crypto/crct10dif-vpmsum_asm.S
@@ -0,0 +1,850 @@
+/*
+ * Calculate a CRC T10DIF  with vpmsum acceleration
+ *
+ * Constants generated by crc32-vpmsum, available at
+ * https://github.com/antonblanchard/crc32-vpmsum
+ *
+ * crc32-vpmsum is
+ * Copyright (C) 2015 Anton Blanchard , IBM
+ * and is available under the GPL v2 or later.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+   .section.rodata
+.balign 16
+
+.byteswap_constant:
+   /* byte reverse permute constant */
+   .octa 0x0F0E0D0C0B0A09080706050403020100
+
+.constants:
+
+   /* Reduce 262144 kbits to 1024 bits */
+   /* x^261184 mod p(x), x^261120 mod p(x) */
+   .octa 0x56d35255
+
+   /* x^260160 mod p(x), x^260096 mod p(x) */
+   .octa 0xee67a1e4
+
+   /* x^259136 mod p(x), x^259072 mod p(x) */
+   .octa 0x60834ad1
+
+   /* x^258112 mod p(x), x^258048 mod p(x) */
+   .octa 0x8cfe9ab4
+
+   /* x^257088 mod p(x), x^257024 mod p(x) */
+   .octa 0x3e93fdb5
+
+   /* x^256064 mod p(x), x^256000 mod p(x) */
+   .octa 0x3c204548
+
+   /* x^255040 mod p(x), x^254976 mod p(x) */
+   .octa 0xb1fc8d69
+
+   /* x^254016 mod p(x), x^253952 mod p(x) */
+   .octa 0xf82b24ad
+
+   /* x^252992 mod p(x), x^252928 mod p(x) */
+   .octa 0x44429f1a
+
+   /* x^251968 mod p(x), x^251904 mod p(x) */
+   .octa 0xe88c66ec
+
+   /* x^250944 mod p(x), x^250880 mod p(x) */
+   .octa 0x385cc87d
+
+   /* x^249920 mod p(x), x^249856 mod p(x) */
+   .octa 0x3227c8ff
+
+   /* x^248896 mod p(x), x^248832 mod p(x) */
+   .octa 0xa9a93344
+
+   /* x^247872 mod p(x), x^247808 mod p(x) */
+   .octa 0xabaa66eb
+
+   /* x^246848 mod p(x), x^246784 mod p(x) */
+   .octa 0x1ac3c4ef
+
+   /* x^245824 mod p(x), x^245760 mod p(x) */
+   .octa 0x63f056f3
+
+   /* x^244800 mod p(x), x^244736 mod p(x) */
+   .octa 0x32cc0205
+
+   /* x^243776 mod p(x), x^243712 mod p(x) */
+   .octa 0xf8b5568e
+
+   /* x^242752 mod p(x), x^242688 mod p(x) */
+   .octa 0x8db16429
+
+   /* x^241728 mod p(x), x^241664 mod p(x) */
+   .octa 0x59ca6b66
+
+   /* x^240704 mod p(x), x^240640 mod p(x) */
+   .octa 0x5f5c18f8
+
+   /* x^239680 mod p(x), x^239616 mod p(x) */
+   .octa 0x61afb609
+
+   /* x^238656 mod p(x), x^238592 mod p(x) */
+   .octa 0xe29e099a
+
+  

[PATCH 2/4] crypto: powerpc - Re-enable non-REFLECTed CRCs

2017-03-15 Thread Daniel Axtens
When CRC32c was included in the kernel, Anton ripped out
the #ifdefs around reflected polynomials, because CRC32c
is always reflected. However, not all CRCs use reflection
so we'd like to make it optional.

Restore the REFLECT parts from Anton's original CRC32
implementation (https://github.com/antonblanchard/crc32-vpmsum)

That implementation is available under GPLv2+, so we're OK
from a licensing point of view:
https://github.com/antonblanchard/crc32-vpmsum/blob/master/LICENSE.TXT

As CRC32c requires REFLECT, add that #define.

Cc: Anton Blanchard 
Signed-off-by: Daniel Axtens 

---

I compared the disassembly of the CRC32c module on LE before and
after the change, and verified that they were the same.

I verified that the crypto self-tests still pass on LE and BE, and
my tests in patch 4 still pass as well.
---
 arch/powerpc/crypto/crc32-vpmsum_core.S | 31 ++-
 arch/powerpc/crypto/crc32c-vpmsum_asm.S |  1 +
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/crypto/crc32-vpmsum_core.S 
b/arch/powerpc/crypto/crc32-vpmsum_core.S
index 629244ef170e..87fabf4d391a 100644
--- a/arch/powerpc/crypto/crc32-vpmsum_core.S
+++ b/arch/powerpc/crypto/crc32-vpmsum_core.S
@@ -35,7 +35,9 @@
 
.text
 
-#if defined(__BIG_ENDIAN__)
+#if defined(__BIG_ENDIAN__) && defined(REFLECT)
+#define BYTESWAP_DATA
+#elif defined(__LITTLE_ENDIAN__) && !defined(REFLECT)
 #define BYTESWAP_DATA
 #else
 #undef BYTESWAP_DATA
@@ -108,7 +110,11 @@ FUNC_START(CRC_FUNCTION_NAME)
/* Get the initial value into v8 */
vxorv8,v8,v8
MTVRD(v8, R3)
+#ifdef REFLECT
vsldoi  v8,zeroes,v8,8  /* shift into bottom 32 bits */
+#else
+   vsldoi  v8,v8,zeroes,4  /* shift into top 32 bits */
+#endif
 
 #ifdef BYTESWAP_DATA
addis   r3,r2,.byteswap_constant@toc@ha
@@ -354,6 +360,7 @@ FUNC_START(CRC_FUNCTION_NAME)
vxorv6,v6,v14
vxorv7,v7,v15
 
+#ifdef REFLECT
/*
 * vpmsumd produces a 96 bit result in the least significant bits
 * of the register. Since we are bit reflected we have to shift it
@@ -368,6 +375,7 @@ FUNC_START(CRC_FUNCTION_NAME)
vsldoi  v5,v5,zeroes,4
vsldoi  v6,v6,zeroes,4
vsldoi  v7,v7,zeroes,4
+#endif
 
/* xor with last 1024 bits */
lvx v8,0,r4
@@ -511,13 +519,33 @@ FUNC_START(CRC_FUNCTION_NAME)
vsldoi  v1,v0,v0,8
vxorv0,v0,v1/* xor two 64 bit results together */
 
+#ifdef REFLECT
/* shift left one bit */
vspltisb v1,1
vsl v0,v0,v1
+#endif
 
vandv0,v0,mask_64bit
+#ifndef REFLECT
+   /*
+* Now for the Barrett reduction algorithm. The idea is to calculate q,
+* the multiple of our polynomial that we need to subtract. By
+* doing the computation 2x bits higher (ie 64 bits) and shifting the
+* result back down 2x bits, we round down to the nearest multiple.
+*/
+   VPMSUMD(v1,v0,const1)   /* ma */
+   vsldoi  v1,zeroes,v1,8  /* q = floor(ma/(2^64)) */
+   VPMSUMD(v1,v1,const2)   /* qn */
+   vxorv0,v0,v1/* a - qn, subtraction is xor in GF(2) */
 
/*
+* Get the result into r3. We need to shift it left 8 bytes:
+* V0 [ 0 1 2 X ]
+* V0 [ 0 X 2 3 ]
+*/
+   vsldoi  v0,v0,zeroes,8  /* shift result into top 64 bits */
+#else
+   /*
 * The reflected version of Barrett reduction. Instead of bit
 * reflecting our data (which is expensive to do), we bit reflect our
 * constants and our algorithm, which means the intermediate data in
@@ -537,6 +565,7 @@ FUNC_START(CRC_FUNCTION_NAME)
 * V0 [ 0 X 2 3 ]
 */
vsldoi  v0,v0,zeroes,4  /* shift result into top 64 bits of */
+#endif
 
/* Get it into r3 */
MFVRD(R3, v0)
diff --git a/arch/powerpc/crypto/crc32c-vpmsum_asm.S 
b/arch/powerpc/crypto/crc32c-vpmsum_asm.S
index c0d080caefc1..d2bea48051a0 100644
--- a/arch/powerpc/crypto/crc32c-vpmsum_asm.S
+++ b/arch/powerpc/crypto/crc32c-vpmsum_asm.S
@@ -842,4 +842,5 @@
.octa 0x000105ec76f1
 
 #define CRC_FUNCTION_NAME __crc32c_vpmsum
+#define REFLECT
 #include "crc32-vpmsum_core.S"
-- 
2.9.3



Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-03-15 Thread Peter Zijlstra
On Wed, Mar 15, 2017 at 05:20:15PM +1100, Michael Ellerman wrote:

> > I see no implementation; so why are you poking at it.
> 
> Maddy has posted an implementation of the kernel part for powerpc in
> patch 2 of this series, but maybe you're not on Cc?

I am not indeed. That and a completely inadequate Changelog have lead to
great confusion.


Re: [PATCH] drivers/pcmcia: NO_IRQ removal for electra_cf.c

2017-03-15 Thread Arnd Bergmann
On Wed, Mar 15, 2017 at 6:35 AM, Michael Ellerman  wrote:
> Arnd Bergmann  writes:
>> On Tue, Mar 14, 2017 at 11:51 AM, Michael Ellerman  
>> wrote:
>>> Michael Ellerman  writes:
>>>
>>
>> drivers/ata/pata_mpc52xx.c: if (ata_irq == NO_IRQ) {
>> drivers/ata/sata_dwc_460ex.c:#ifndef NO_IRQ
>> drivers/ata/sata_dwc_460ex.c:#define NO_IRQ 0
>> drivers/ata/sata_dwc_460ex.c:   if (hsdev->dma->irq == NO_IRQ) {
>> drivers/ata/sata_dwc_460ex.c:   if (irq == NO_IRQ) {
>> drivers/iommu/fsl_pamu.c:   if (irq == NO_IRQ) {
>> drivers/iommu/fsl_pamu.c:   if (irq != NO_IRQ)
>> drivers/media/platform/fsl-viu.c:   if (viu_irq == NO_IRQ) {
>> drivers/mtd/nand/mpc5121_nfc.c: if (prv->irq == NO_IRQ) {
>> drivers/pcmcia/electra_cf.c:cf->irq = NO_IRQ;
>> drivers/pcmcia/electra_cf.c:if (cf->irq != NO_IRQ)
>> drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if
>> no interrupt is pending. */
>> drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ;
>> drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if
>> no interrupt is pending. */
>> drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ;
>> drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_low == NO_IRQ) {
>> drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_high != NO_IRQ &&
>> drivers/spi/spi-mpc52xx.c:  if (status && (irq != NO_IRQ))
>> drivers/tty/ehv_bytechan.c: if (stdout_irq == NO_IRQ) {
>> drivers/tty/ehv_bytechan.c: if ((bc->rx_irq == NO_IRQ) ||
>> (bc->tx_irq == NO_IRQ)) {
>> drivers/tty/serial/cpm_uart/cpm_uart_core.c:if (pinfo->port.irq == 
>> NO_IRQ) {
>> drivers/uio/uio_fsl_elbc_gpcm.c:if (irq != NO_IRQ) {
>> drivers/uio/uio_fsl_elbc_gpcm.c:irq = NO_IRQ;
>> drivers/uio/uio_fsl_elbc_gpcm.c: irq != NO_IRQ ? irq : -1);
>> drivers/usb/host/ehci-grlib.c:  if (irq == NO_IRQ) {
>> drivers/usb/host/ehci-ppc-of.c: if (irq == NO_IRQ) {
>> drivers/usb/host/fhci-hcd.c:if (usb_irq == NO_IRQ) {
>> drivers/usb/host/ohci-ppc-of.c: if (irq == NO_IRQ) {
>> drivers/usb/host/uhci-grlib.c:  if (irq == NO_IRQ) {
>> drivers/video/fbdev/mb862xx/mb862xxfbdrv.c: if (par->irq == NO_IRQ) {
>> drivers/virt/fsl_hypervisor.c:  if (!handle || (irq == NO_IRQ)) {
>> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
>> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
>> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
>> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
>> include/soc/fsl/qe/qe_ic.h: if (cascade_irq == NO_IRQ)
>> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
>>
>> Did you have other pending patches for those?
>
> No. I stayed away from anything FSL related as I was under the
> impression some of them were being ported to arch/arm, which uses -1 for
> NO_IRQ IIUIC.
>
> eg. all of include/soc/fsl and drivers/soc/fsl was moved from
> arch/powerpc in commit 7aa1aa6ecec2, which said:
>
> QE: Move QE from arch/powerpc to drivers/soc
>
> ls1 has qe and ls1 has arm cpu.
> move qe from arch/powerpc to drivers/soc/fsl
> to adapt to powerpc and arm
>
> But looking at the Kconfigs it looks like they're still only selectable
> on PPC. So that's a bit annoying.
>
> I'll do patches for everything above that's not drivers/soc or
> include/soc and hopefully we can hear from someone at NXP on the plans
> for getting the soc parts enabled on arm.

I think the removal of the NO_IRQ references is a requirement for
getting the drivers working properly on ARM, as the OF platform
code will use '0' for invalid IRQs.

Arnd


Re: [PATCH] powerpc/powernv: de-deuplicate OPAL call wrappers

2017-03-15 Thread Michael Ellerman
Oliver O'Halloran  writes:

> Currently the code to perform an OPAL call is duplicated between the
> normal path and path taken when tracepoints are enabled. There's no
> real need for this and combining them makes opal_tracepoint_entry
> considerably easier to understand.

Sorry I missed this when you sent it.

I don't think it applies any more, but if you rebased it I'd merge it.

cheers


Re: [PATCH 5/5] powerpc/smp: Add Power9 scheduler topology

2017-03-15 Thread Michael Ellerman
Balbir Singh  writes:

> On 02/03/17 11:49, Oliver O'Halloran wrote:
>> In previous generations of Power processors each core had a private L2
>> cache. The Power9 processor has a slightly different architecture where
>> the L2 cache is shared among pairs of cores rather than being completely
>> private.
>> 
>> Making the scheduler aware of this cache sharing allows the scheduler to
>> make more intelligent migration decisions. When one core in the pair is
>> overloaded tasks can be migrated to its paired core to improve throughput
>> without cache-refilling penality typically associated with task
>> migration.
>
> Could you please describe the changes to sched_domains w.r.t before and
> after for P9.

Yes please.

cheers


Re: [PATCH 5/5] powerpc/smp: Add Power9 scheduler topology

2017-03-15 Thread Michael Ellerman
Oliver O'Halloran  writes:

> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 5571f30ff72d..5e1811b24415 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -724,10 +724,17 @@ static void update_cpu_masks(int cpu, bool onlining)
>  
>   update_thread_mask(cpu, onlining);
>  
> + /* we need the l2 cache mask for the power9 scheduler topology */
> + if (cpu_has_feature(CPU_FTR_ARCH_300))
> + update_mask_by_l2(cpu, onlining, cpu_cache_mask);

I guess I missed something somewhere, why do we need to check the CPU
feature?

...
> @@ -829,7 +862,10 @@ void __init smp_cpus_done(unsigned int max_cpus)
>  
>   dump_numa_cpu_topology();
>  
> - set_sched_topology(powerpc_topology);
> + if (cpu_has_feature(CPU_FTR_ARCH_300))
> + set_sched_topology(power9_topology);
> + else
> + set_sched_topology(powerpc_topology);
  
Ideally this would just all come from the device tree. If we detect that
the L2 is shared then we tell the scheduler that, we shouldn't need to
know that P9 is special. It certainly doesn't say anywhere in ISA 3.00
that the L2 is shared.

cheers


Re: [PATCH 4/5] powerpc/smp: add cpu_cache_mask

2017-03-15 Thread Michael Ellerman
Oliver O'Halloran  writes:

> Traditionally we have only ever tracked which CPUs are in the same core
> (cpu_sibling_mask) and on the same die (cpu_core_mask). For Power9 we
> need to be aware of which CPUs share cache with each other so this patch
> adds cpu_cache_mask and the underlying cpu_cache_map variable to track
> this.

But which cache?

Some CPUs on Power8 share L3, or L4.

I think just call it cpu_l2cache_map to make it explicit.

cheers


Re: [PATCH 3/5] powerpc/smp: Add update_cpu_masks()

2017-03-15 Thread Michael Ellerman
Oliver O'Halloran  writes:

> When adding and removing a CPU from the system the per-cpu masks that
> are used by the scheduler to construct scheduler domains need to be updated
> to account for the cpu entering or exiting the system. Currently logic this
> is open-coded for the thread sibling mask and shared for the core mask.

"logic this is"

> This patch moves all the logic for rebuilding these masks into a single
> function and simplifies the logic which determines which CPUs are within
> a "core".

The diff is hard to read but I think it's OK. Other than ...

> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 1c531887ca51..3922cace927e 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -630,14 +630,20 @@ int cpu_first_thread_of_core(int core)
>  }
>  EXPORT_SYMBOL_GPL(cpu_first_thread_of_core);
>  
> -static void traverse_siblings_chip_id(int cpu, bool add, int chipid)
> +static bool update_core_mask_by_chip_id(int cpu, bool add)
 ^
 call it
 onlining like below

>  {
>   const struct cpumask *mask = add ? cpu_online_mask : cpu_present_mask;
> + int chipid = cpu_to_chip_id(cpu);
>   int i;
>  
> + if (chipid == -1)
> + return false;
> +
>   for_each_cpu(i, mask)
>   if (cpu_to_chip_id(i) == chipid)
>   set_cpus_related(cpu, i, add, cpu_core_mask);
> +
> + return true;
>  }
>  
>  /* Must be called when no change can occur to cpu_present_mask,
> @@ -662,42 +668,72 @@ static struct device_node *cpu_to_l2cache(int cpu)
>   return cache;
>  }
>  
> -static void traverse_core_siblings(int cpu, bool add)
> +static bool update_core_mask_by_l2(int cpu, bool onlining)
>  {
> + const struct cpumask *mask = onlining ? cpu_online_mask : 
> cpu_present_mask;
>   struct device_node *l2_cache, *np;
> - const struct cpumask *mask;
> - int chip_id;
>   int i;
>  
> - /* threads that share a chip-id are considered siblings (same die) */
> - chip_id = cpu_to_chip_id(cpu);
> -
> - if (chip_id >= 0) {
> - traverse_siblings_chip_id(cpu, add, chip_id);
> - return;
> - }
> -
> - /* if the chip-id fails then group siblings by the L2 cache */
>   l2_cache = cpu_to_l2cache(cpu);
> - mask = add ? cpu_online_mask : cpu_present_mask;
> + if (l2_cache == NULL)
> + return false;
> +
>   for_each_cpu(i, mask) {
>   np = cpu_to_l2cache(i);
>   if (!np)
>   continue;
>  
>   if (np == l2_cache)
> - set_cpus_related(cpu, i, add, cpu_core_mask);
> + set_cpus_related(cpu, i, onlining, cpu_core_mask);
>  
>   of_node_put(np);
>   }
>   of_node_put(l2_cache);
> +
> + return true;
> +}
> +
> +static void update_thread_mask(int cpu, bool onlining)
> +{
> + int base = cpu_first_thread_sibling(cpu);
> + int i;
> +
> + pr_info("CPUDEBUG: onlining cpu %d, base %d, thread_per_core %d",
> + cpu, base, threads_per_core);

That's too spammy for cpu hotplug. Make it pr_debug() at most.

cheers


Re: [PATCH 2/5] powerpc/smp: add set_cpus_related()

2017-03-15 Thread Michael Ellerman
Oliver O'Halloran  writes:
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index dfe0e1d9cd06..1c531887ca51 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -377,6 +377,25 @@ static void smp_store_cpu_info(int id)
>  #endif
>  }
>  
> +/*
> + * Relationships between CPUs are maintained in a set of per-cpu cpumasks. We
> + * need to ensure that they are kept consistant between CPUs when they are
> + * changed.
> + *
> + * This is slightly tricky since the core mask must be a strict superset of
> + * the sibling mask.
> + */
> +static void set_cpus_related(int i, int j, bool related, struct cpumask 
> *(*relation_fn)(int))
> +{
> + if (related) {
> + cpumask_set_cpu(i, relation_fn(j));
> + cpumask_set_cpu(j, relation_fn(i));
> + } else {
> + cpumask_clear_cpu(i, relation_fn(j));
> + cpumask_clear_cpu(j, relation_fn(i));
> + }
> +}

I think you pushed the abstraction one notch too far on this one, or
perhaps not far enough.

We end up with a function called "set" that might clear, depending on a
bool you pass. Which is hard to parse, eg:

set_cpus_related(cpu, base + i, false, cpu_sibling_mask);

And I know there's two places where we pass an existing bool "add", but
there's four where we pass true or false.

If we want to push it in that direction I think we should just pass the
set/clear routine instead of the flag, so:

do_cpus_related(cpu, base + i, cpumask_clear_cpu, cpu_sibling_mask);

But that might be overdoing it.

So I think we should just do:

static void set_cpus_related(int i, int j, struct cpumask *(*mask_func)(int))
{
cpumask_set_cpu(i, mask_func(j));
cpumask_set_cpu(j, mask_func(i));
}

static void clear_cpus_related(int i, int j, struct cpumask *(*mask_func)(int))
{
cpumask_clear_cpu(i, mask_func(j));
cpumask_clear_cpu(j, mask_func(i));
}


So the cases with add become:

if (add)
set_cpus_related(cpu, i, cpu_core_mask(i));
else
clear_cpus_related(cpu, i, cpu_core_mask(i));

Which is not as pretty but more explicit.

And the other cases look much better, eg:

clear_cpus_related(cpu, base + i, cpu_sibling_mask);

??

cheers


Re: [PATCH 1/5] powerpc/smp: use cpu_to_chip_id() to find siblings

2017-03-15 Thread Michael Ellerman
Oliver O'Halloran  writes:

> To determine which logical CPUs are on the same core the kernel uses the
> ibm,chipid property from the device tree node associated with that cpu.
> The lookup for this this information is currently open coded in both
> traverse_siblings() and traverse_siblings_chip_id(). This patch replaces
> these manual lookups with the existing cpu_to_chip_id() function.

Some minor nits.

cpu_to_chip_id() actually searches recursively up the parents until it
finds a ibm,chip-id, so it's not a 1:1 replacement for the existing
logic, but it's probably still an OK conversion. It's still worth
mentioning in the change log thought.

> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 893bd7f79be6..dfe0e1d9cd06 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -664,23 +655,19 @@ static void traverse_core_siblings(int cpu, bool add)
>  {
>   struct device_node *l2_cache, *np;
>   const struct cpumask *mask;
> - int i, chip, plen;
> - const __be32 *prop;
> + int chip_id;
> + int i;
>  
> - /* First see if we have ibm,chip-id properties in cpu nodes */
> - np = of_get_cpu_node(cpu, NULL);
> - if (np) {
> - chip = -1;
> - prop = of_get_property(np, "ibm,chip-id", );
> - if (prop && plen == sizeof(int))
> - chip = of_read_number(prop, 1);
> - of_node_put(np);
> - if (chip >= 0) {
> - traverse_siblings_chip_id(cpu, add, chip);
> - return;
> - }
> + /* threads that share a chip-id are considered siblings (same die) */

You might know it means the "same die", but AFAIK there's no actual
definition for what the chip-id means, so let's not write comments that
might be wrong in future. Just saying they're considered siblings is
sufficient.

Also "Threads" :)

cheers


Re: [PATCH 2/3] powernv:idle: Don't override default/deepest directly in kernel

2017-03-15 Thread Gautham R Shenoy
Hi Nick,

Thanks for reviewing the patch.

On Wed, Mar 15, 2017 at 12:05:43AM +1000, Nicholas Piggin wrote:
> On Mon, 13 Mar 2017 11:31:27 +0530
> "Gautham R. Shenoy"  wrote:
> 
> > From: "Gautham R. Shenoy" 
> > 
> > Currently during idle-init on power9, if we don't find suitable stop
> > states in the device tree that can be used as the
> > default_stop/deepest_stop, we set stop0 (ESL=1,EC=1) as the default
> > stop state psscr to be used by power9_idle and deepest stop state
> > which is used by CPU-Hotplug.
> > 
> > However, if the platform firmware has not configured or enabled a stop
> > state, the kernel should not make any assumptions and fallback to a
> > default choice.
> > 
> > If the kernel uses a stop state that is not configured by the platform
> > firmware, it may lead to further failures which should be avoided.
> > 
> > In this patch, we modify the init code to ensure that the kernel uses
> > only the stop states exposed by the firmware through the device
> > tree. When a suitable default stop state isn't found, we disable
> > ppc_md.power_save for power9. Similarly, when a suitable
> > deepest_stop_state is not found in the device tree exported by the
> > firmware, fall back to the default busy-wait loop in the CPU-Hotplug
> > code.
> 
> Seems reasonable. I have a few comments that you may consider. Nothing
> too major.
> 
> Btw., it would be nice to move this hotplug idling selection code to
> idle.c. Have the hotplug just ask to enter the best available idle mode
> and that's it. I'm not asking you to do that for this series, but perhaps
> consider it for the future.

That's not a bad idea. I will do it in the respin of the patchset.

> 
> 
> > diff --git a/arch/powerpc/platforms/powernv/idle.c 
> > b/arch/powerpc/platforms/powernv/idle.c
> > index 4ee837e..9fde6e4 100644
> > --- a/arch/powerpc/platforms/powernv/idle.c
> > +++ b/arch/powerpc/platforms/powernv/idle.c
> > @@ -147,7 +147,6 @@ u32 pnv_get_supported_cpuidle_states(void)
> >  }
> >  EXPORT_SYMBOL_GPL(pnv_get_supported_cpuidle_states);
> >  
> > -
> >  static void pnv_fastsleep_workaround_apply(void *info)
> >  
> >  {
> > @@ -243,6 +242,7 @@ static DEVICE_ATTR(fastsleep_workaround_applyonce, 0600,
> >   */
> >  u64 pnv_default_stop_val;
> >  u64 pnv_default_stop_mask;
> > +bool default_stop_found;
> >  
> >  /*
> >   * Used for ppc_md.power_save which needs a function with no parameters
> > @@ -264,6 +264,7 @@ static void power9_idle(void)
> >   */
> >  u64 pnv_deepest_stop_psscr_val;
> >  u64 pnv_deepest_stop_psscr_mask;
> > +bool deepest_stop_found;
> >  
> >  /*
> >   * Power ISA 3.0 idle initialization.
> 
> If the hotplug idle code was in idle.c, then all this deepest/default stop
> logic and register settings would be static to idle.c, which would be nice.
> 
> If you have a function to check if deepest stop is found, then you don't need
> a non-static variable here (or for default_stop_found AFAIKS).

Sure!

> 
> 
> > @@ -352,7 +353,6 @@ static int __init pnv_power9_idle_init(struct 
> > device_node *np, u32 *flags,
> > u32 *residency_ns = NULL;
> > u64 max_residency_ns = 0;
> > int rc = 0, i;
> > -   bool default_stop_found = false, deepest_stop_found = false;
> >  
> > psscr_val = kcalloc(dt_idle_states, sizeof(*psscr_val), GFP_KERNEL);
> > psscr_mask = kcalloc(dt_idle_states, sizeof(*psscr_mask), GFP_KERNEL);
> > @@ -433,20 +433,22 @@ static int __init pnv_power9_idle_init(struct 
> > device_node *np, u32 *flags,
> > }
> >  
> > if (!default_stop_found) {
> > -   pnv_default_stop_val = PSSCR_HV_DEFAULT_VAL;
> > -   pnv_default_stop_mask = PSSCR_HV_DEFAULT_MASK;
> > -   pr_warn("Setting default stop psscr 
> > val=0x%016llx,mask=0x%016llx\n",
> > +   pr_warn("powernv:idle: Default stop not found. Disabling 
> > ppcmd.powersave\n");
> > +   } else {
> > +   pr_info("powernv:idle: Default stop: psscr = 
> > 0x%016llx,mask=0x%016llx\n",
> > pnv_default_stop_val, pnv_default_stop_mask);
> > }
> >  
> > if (!deepest_stop_found) {
> > -   pnv_deepest_stop_psscr_val = PSSCR_HV_DEFAULT_VAL;
> > -   pnv_deepest_stop_psscr_mask = PSSCR_HV_DEFAULT_MASK;
> > -   pr_warn("Setting default stop psscr 
> > val=0x%016llx,mask=0x%016llx\n",
> > +   pr_warn("powernv:idle: Deepest stop not found. CPU-Hotplug is 
> > affected\n");
> 
> I guess these warnings are meant for developers, but it might be nice
> to make them slightly more meaningful? Prefix of choice seems to be
> "cpuidle-powernv:"


> 
> Then you could have "no suitable stop state provided by firmware. Disabling
> idle power saving" and "no suitable stop state provided by firmware. Offlined
> CPUs will busy-wait", perhaps?

How about
pr_warn("cpuidle-powernv: No suitable stop for CPU-Hotplug. 
Offlined CPUs will busy wait\n");
> 
> Just a suggestion.
> 
> > +   } else {
> > + 

Re: [FIX PATCH v1] powerpc/pseries: Fix reference count leak during CPU unplug

2017-03-15 Thread Sachin Sant

>> So you suggest that adding of_node_get() to __of_attach_node_sysfs()
>> is the right fix ?
> 
> If I understand that this only creates for hot-added cpus then no. Also
> for this to be the correct fix I would expect to see this recreate for
> all hot-remove operations such as memory and pci devices as well.
> 

So I can recreate this problem while removing a I/O adapter as well.

Here is a trace against 4.11.0-rc1-next20170310 while performing
a DLPAR remove operation on a I/O adapter.

[  549.815605] rpaphp: Slot [U78C7.001.RCH0042-P1-C1] registered
[  549.815608] rpadlpar_io: slot PHB 64 added
[  575.267302] iommu: Removing device 0040:01:00.0 from group 1
[  575.267401] iommu: Removing device 0040:01:00.1 from group 1
[  575.267842] refcount_t: underflow; use-after-free.
[  575.267855] [ cut here ]
[  575.267862] WARNING: CPU: 2 PID: 3837 at lib/refcount.c:128 
refcount_sub_and_test+0xf4/0x110
[  575.267865] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp tcp_diag 
udp_diag inet_diag unix_diag af_packet_diag netlink_diag rpcrdma sunrpc 
ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt 
target_core_mod ib_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm 
iw_cm iw_cxgb3 ib_core ghash_generic xts gf128mul tpm_ibmvtpm tpm vmx_crypto 
pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sr_mod sd_mod cdrom cxgb3 
ibmvscsi ibmveth scsi_transport_srp mdio dm_mirror dm_region_hash dm_log dm_mod
[  575.267904] CPU: 2 PID: 3837 Comm: drmgr Not tainted 
4.11.0-rc1-next-20170310 #4
[  575.267907] task: c0076f041600 task.stack: c0076f084000
[  575.267910] NIP: c1aa69c4 LR: c1aa69c0 CTR: 006338e4
[  575.267913] REGS: c0076f0878a0 TRAP: 0700   Not tainted  
(4.11.0-rc1-next-20170310)
[  575.267915] MSR: 80029033 
[  575.267920]   CR: 42002422  XER: 0007
[  575.267923] CFAR: c1edb5e0 SOFTE: 1 
[  575.267923] GPR00: c1aa69c0 c0076f087b20 c2605f00 
0026 
[  575.267923] GPR04:  80100fe93ec0 00492b9a 
 
[  575.267923] GPR08: 0001 0007 0006 
3ff0 
[  575.267923] GPR12: 2200 ce801200  
 
[  575.267923] GPR16:    
 
[  575.267923] GPR20:    
 
[  575.267923] GPR24:  c001db50a78c 10016650 
 
[  575.267923] GPR28: c001dd1a7500 c001dd1a7200 c001db50a780 
c001dd1a7258 
[  575.267955] NIP [c1aa69c4] refcount_sub_and_test+0xf4/0x110
[  575.267958] LR [c1aa69c0] refcount_sub_and_test+0xf0/0x110
[  575.267960] Call Trace:
[  575.267962] [c0076f087b20] [c1aa69c0] 
refcount_sub_and_test+0xf0/0x110 (unreliable)
[  575.267967] [c0076f087b80] [c1a84f1c] kobject_put+0x3c/0xa0
[  575.267972] [c0076f087bf0] [c1d239b4] of_node_put+0x24/0x40
[  575.267976] [c0076f087c10] [c165ce74] ofdt_write+0x204/0x6b0
[  575.267980] [c0076f087cd0] [c197bde0] proc_reg_write+0x80/0xd0
[  575.267984] [c0076f087d00] [c18df040] __vfs_write+0x40/0x1c0
[  575.267987] [c0076f087d90] [c18e0998] vfs_write+0xc8/0x240
[  575.267991] [c0076f087de0] [c18e2600] SyS_write+0x60/0x110
[  575.267995] [c0076f087e30] [c15cb184] system_call+0x38/0xe0
[  575.267997] Instruction dump:
[  575.267999] 7863d182 4e800020 7c0802a6 3921 3d42fff8 3c62ffb1 386306a0 
992afd41 
[  575.268004] f8010010 f821ffa1 48434be5 6000 <0fe0> 38210060 3860 
e8010010 
[  575.268010] ---[ end trace e6c0a4371a0aa4e2 ]—

Thanks
-Sachin


Re: [PATCH v2 16/23] MAINTAINERS: Add file patterns for powerpc device tree bindings

2017-03-15 Thread Geert Uytterhoeven
Hi Michael,

On Wed, Mar 15, 2017 at 10:30 AM, Michael Ellerman  wrote:
> Geert Uytterhoeven  writes:
>> On Wed, Mar 15, 2017 at 1:19 AM, Michael Ellerman  
>> wrote:
>>> Geert Uytterhoeven  writes:
 Submitters of device tree binding documentation may forget to CC
 the subsystem maintainer if this is missing.

 Signed-off-by: Geert Uytterhoeven 
 Cc: Benjamin Herrenschmidt 
 Cc: Paul Mackerras 
 Cc: Michael Ellerman 
 Cc: linuxppc-dev@lists.ozlabs.org
 ---
 Please apply this patch directly if you want to be involved in device
 tree binding documentation for your subsystem.

 v2:
   - Rebased on top of commit a42715830d552d7c ("MAINTAINERS: Remove
 powerpc's "opal" pattern match), which just added "powerpc/opal",
 while obviously the whole "powerpc" hierarchy is of interest.

 Impact on next-20170310:
>>>
>>> Actual diff ignoring ordering etc:
>>>
>>>  +Benjamin Herrenschmidt  (supporter:LINUX FOR 
>>> POWERPC (32-BIT AND 64-BIT))
>>>  +Paul Mackerras  (supporter:LINUX FOR POWERPC (32-BIT 
>>> AND 64-BIT))
>>>  +linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND 
>>> 64-BIT))
>>>  -Scott Wood  (commit_signer:5/11=45%)
>>>  -Zhao Qiang  (commit_signer:4/11=36%,authored:4/11=36%)
>>>  -Christian Lamparter  (commit_signer:1/11=9%)
>>>  -yangbo lu  (authored:1/11=9%)
>>>  -"Otto Kekäläinen"  (authored:1/11=9%)
>>>  -Chris Packham  (authored:1/11=9%)
>>>  -York Sun  (authored:1/11=9%)
>>>
>>> Which looks bad as all the NXP folks get dropped, but they should be
>>> reading linuxppc-dev. So I think I'll merge this, unless anyone
>>> disagrees.
>>
>> They got dropped because in the absence of a maintainer entry, the
>> last committers are listed. If they need to be listed, I can send patches to 
>> add
>> more specific entries for 4xx and fsl DT bindings, like:
>
> This is one is probably worth doing:
>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 0f1c2f96c99aa936..40b392a4f399adbe 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -5266,6 +5266,7 @@ M:Scott Wood 
>>  L: linuxppc-dev@lists.ozlabs.org
>>  L: linux-arm-ker...@lists.infradead.org
>>  S: Maintained
>> +F: Documentation/devicetree/bindings/powerpc/fsl/
>>  F: drivers/soc/fsl/
>>  F: include/linux/fsl/
>
> I'll fold it in before applying?

Fine (less work) for me! Thanks!

> But I wouldn't bother with 4xx, just falling back to linuxppc-dev is
> fine for that.

OK.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH v2 16/23] MAINTAINERS: Add file patterns for powerpc device tree bindings

2017-03-15 Thread Michael Ellerman
Geert Uytterhoeven  writes:
> On Wed, Mar 15, 2017 at 1:19 AM, Michael Ellerman  wrote:
>> Geert Uytterhoeven  writes:
>>> Submitters of device tree binding documentation may forget to CC
>>> the subsystem maintainer if this is missing.
>>>
>>> Signed-off-by: Geert Uytterhoeven 
>>> Cc: Benjamin Herrenschmidt 
>>> Cc: Paul Mackerras 
>>> Cc: Michael Ellerman 
>>> Cc: linuxppc-dev@lists.ozlabs.org
>>> ---
>>> Please apply this patch directly if you want to be involved in device
>>> tree binding documentation for your subsystem.
>>>
>>> v2:
>>>   - Rebased on top of commit a42715830d552d7c ("MAINTAINERS: Remove
>>> powerpc's "opal" pattern match), which just added "powerpc/opal",
>>> while obviously the whole "powerpc" hierarchy is of interest.
>>>
>>> Impact on next-20170310:
>>
>> Actual diff ignoring ordering etc:
>>
>>  +Benjamin Herrenschmidt  (supporter:LINUX FOR 
>> POWERPC (32-BIT AND 64-BIT))
>>  +Paul Mackerras  (supporter:LINUX FOR POWERPC (32-BIT AND 
>> 64-BIT))
>>  +linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND 
>> 64-BIT))
>>  -Scott Wood  (commit_signer:5/11=45%)
>>  -Zhao Qiang  (commit_signer:4/11=36%,authored:4/11=36%)
>>  -Christian Lamparter  (commit_signer:1/11=9%)
>>  -yangbo lu  (authored:1/11=9%)
>>  -"Otto Kekäläinen"  (authored:1/11=9%)
>>  -Chris Packham  (authored:1/11=9%)
>>  -York Sun  (authored:1/11=9%)
>>
>> Which looks bad as all the NXP folks get dropped, but they should be
>> reading linuxppc-dev. So I think I'll merge this, unless anyone
>> disagrees.
>
> They got dropped because in the absence of a maintainer entry, the
> last committers are listed. If they need to be listed, I can send patches to 
> add
> more specific entries for 4xx and fsl DT bindings, like:

This is one is probably worth doing:

> diff --git a/MAINTAINERS b/MAINTAINERS
> index 0f1c2f96c99aa936..40b392a4f399adbe 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5266,6 +5266,7 @@ M:Scott Wood 
>  L: linuxppc-dev@lists.ozlabs.org
>  L: linux-arm-ker...@lists.infradead.org
>  S: Maintained
> +F: Documentation/devicetree/bindings/powerpc/fsl/
>  F: drivers/soc/fsl/
>  F: include/linux/fsl/

I'll fold it in before applying?

But I wouldn't bother with 4xx, just falling back to linuxppc-dev is
fine for that.

> @@ -7540,6 +7541,7 @@ M:Matt Porter 
>  W: http://www.penguinppc.org/
>  L: linuxppc-dev@lists.ozlabs.org
>  S: Maintained
> +F: Documentation/devicetree/bindings/powerpc/4xx/
>  F: arch/powerpc/platforms/40x/
>  F: arch/powerpc/platforms/44x/

cheers


Re: [PATCH v5 2/5] powerpc: kretprobes: override default function entry offset

2017-03-15 Thread Naveen N. Rao
On 2017/03/14 10:18AM, Arnaldo Carvalho de Melo wrote:
> Em Thu, Mar 09, 2017 at 05:37:38PM +1100, Michael Ellerman escreveu:
> > "Naveen N. Rao"  writes:
> > > On 2017/03/08 11:29AM, Arnaldo Carvalho de Melo wrote:
> > >> > I wasn't sure if you were planning on picking up KPROBES_ON_FTRACE for 
> > >> > v4.11. If so, it would be good to take this patch through the powerpc 
> > >> > tree. Otherwise, this can go via Ingo's tree.
> > >> 
> > >> If you guys convince Ingo that this should go _now_, then just cherry
> > >> pick what was merged into tip/perf/core that is needed for the arch
> > >> specific stuff and go from there.
> > >
> > > Ok, in hindsight, I think Michael's concern was actually for v4.12 
> > 
> > Yes I was talking about 4.12, sorry I thought that was implied :)
> > 
> > > itself, in which case this particular patch can go via powerpc tree, 
> > > while the rest of the patches in this series can go via your tree.
> > >
> > > Michael?
> > 
> > Yeah I think that's the easiest option. The function will be temporarily
> > unused until the two trees are merged, but I think that's fine.
> 
> Ok, done that, now compile testing building it in my
> multi-distro/x-build containers.

Thanks, Arnaldo!

I did however notice that you don't seem to have applied Patch 1/5 from 
this series:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1347858.html

That patch is needed to ensure perf continues to work when ftrace README 
advertises support for ref_reloc_sym+offset for kretprobes. Can you 
please apply that as well?

- Naveen



Re: [PATCH v3 1/2] powerpc: split ftrace bits into a separate file

2017-03-15 Thread Naveen N. Rao
On 2017/03/10 11:54AM, Steven Rostedt wrote:
> On Fri, 10 Mar 2017 21:38:53 +0530
> "Naveen N. Rao"  wrote:
> 
> > On 2017/03/10 10:45AM, Steven Rostedt wrote:
> > > On Thu, 02 Mar 2017 20:38:53 +1100
> > > Michael Ellerman  wrote:
> 
> > > > So if we drop that we're left with ftrace.S - which seems perfect to 
> > > > me.  
> > > 
> > > Yeah, I agree. But then there's the problem that ftrace.c and ftrace.S
> > > will get the same ftrace.o. Maybe make it ftrace-hook.S ?  
> > 
> > I've avoided that issue by naming the files ftrace_32.S and ftrace_64.S 
> > (which gets further split up).
> 
> That's what I looked at doing for x86 as well. But not all archs have
> 32 / 64 splits. Should we look to have something that all archs can be
> consistent with?

I don't have a strong opinion about this, but I feel that x86 can simply 
use ftrace_64.S, seeing as the current name is mcount_64.S.

Other architectures could do something similar too, or fall back to 
ftrace_hook.S. That way, all ftrace low-level code can simply be 
referred to as arch/*/ftrace_*.S

- Naveen



Re: [PATCH 1/2] powerpc/mm: Refactor page table allocation

2017-03-15 Thread Michael Ellerman
Balbir Singh  writes:

> Introduce a helper pgtable_get_gfp_flags() which

Can we just call it pgtable_gfp_flags() ?

> just returns the current gfp flags. In a future
> patch, we can enable __GFP_ACCOUNT based on the
> calling context.
>
> Signed-off-by: Balbir Singh 
> ---
>  arch/powerpc/include/asm/book3s/64/pgalloc.h | 22 --
>  arch/powerpc/mm/pgtable_64.c |  3 ++-

It looks like you've only updated the 64-bit Book3S sites. Can you
please do all of them.

I think this is the full list:

  arch/powerpc/include/asm/book3s/32/pgalloc.h:static inline pgd_t 
*pgd_alloc(struct mm_struct *mm)
  arch/powerpc/include/asm/book3s/64/pgalloc.h:static inline pgd_t 
*radix__pgd_alloc(struct mm_struct *mm)
  arch/powerpc/include/asm/book3s/64/pgalloc.h:static inline pgd_t 
*pgd_alloc(struct mm_struct *mm)
  arch/powerpc/include/asm/nohash/32/pgalloc.h:static inline pgd_t 
*pgd_alloc(struct mm_struct *mm)
  arch/powerpc/include/asm/nohash/64/pgalloc.h:static inline pgd_t 
*pgd_alloc(struct mm_struct *mm)
  
  arch/powerpc/include/asm/book3s/64/pgalloc.h:static inline pud_t 
*pud_alloc_one(struct mm_struct *mm, unsigned long addr)
  arch/powerpc/include/asm/nohash/64/pgalloc.h:static inline pud_t 
*pud_alloc_one(struct mm_struct *mm, unsigned long addr)
  
  arch/powerpc/include/asm/book3s/64/pgalloc.h:static inline pmd_t 
*pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
  arch/powerpc/include/asm/nohash/64/pgalloc.h:static inline pmd_t 
*pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
  
  arch/powerpc/include/asm/book3s/64/pgalloc.h:static inline pgtable_t 
pte_alloc_one(struct mm_struct *mm,
  arch/powerpc/include/asm/book3s/64/pgalloc.h:static inline pgtable_t 
pte_alloc_one(struct mm_struct *mm,
  arch/powerpc/include/asm/nohash/64/pgalloc.h:static inline pgtable_t 
pte_alloc_one(struct mm_struct *mm,
  arch/powerpc/include/asm/nohash/64/pgalloc.h:static inline pgtable_t 
pte_alloc_one(struct mm_struct *mm,
  arch/powerpc/mm/pgtable_32.c:pgtable_t pte_alloc_one(struct mm_struct *mm, 
unsigned long address)
  
  arch/powerpc/mm/hugetlbpage.c:pte_t *huge_pte_alloc(struct mm_struct *mm, 
unsigned long addr, unsigned long sz)

> diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
> b/arch/powerpc/include/asm/book3s/64/pgalloc.h
> index cd5e7aa..d0a9ca6 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
> @@ -159,7 +168,8 @@ static inline pgtable_t pmd_pgtable(pmd_t pmd)
>  static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
> unsigned long address)
>  {
> - return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
> + return (pte_t *)__get_free_page(
> + pgtable_get_gfp_flags(mm, PGALLOC_GFP));
>  }

There's no point doing pte_alloc_one_kernel(), it's explicitly for
kernel allocations IIUI.

cheers


[v2 PATCH] powernv-cpuidle: Validate DT property array size

2017-03-15 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The various properties associated with powernv idle states such as
names, flags, residency-ns, latencies-ns, psscr, psscr-mask are
exposed in the device-tree as property arrays such the pointwise
entries in each of these arrays correspond to the properties of the
same idle state.

This patch validates that the lengths of the property arrays are the
same. If there is a mismatch, the patch will ensure that we bail out
and not expose the platform idle states via cpuidle.

Signed-off-by: Gautham R. Shenoy 
---
v1: https://lkml.org/lkml/2017/2/23/349
Changes from v1: Print the full property array name in warning message.

 drivers/cpuidle/cpuidle-powernv.c | 64 +--
 1 file changed, 61 insertions(+), 3 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 3705930..a06df51 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -197,11 +197,25 @@ static inline void add_powernv_state(int index, const 
char *name,
stop_psscr_table[index].mask = psscr_mask;
 }
 
+/*
+ * Returns 0 if prop1_len == prop2_len. Else returns -1
+ */
+static inline int validate_dt_prop_sizes(const char *prop1, int prop1_len,
+const char *prop2, int prop2_len)
+{
+   if (prop1_len == prop2_len)
+   return 0;
+
+   pr_warn("cpuidle-powernv: array sizes don't match for %s and %s\n",
+   prop1, prop2);
+   return -1;
+}
+
 static int powernv_add_idle_states(void)
 {
struct device_node *power_mgt;
int nr_idle_states = 1; /* Snooze */
-   int dt_idle_states;
+   int dt_idle_states, count;
u32 latency_ns[CPUIDLE_STATE_MAX];
u32 residency_ns[CPUIDLE_STATE_MAX];
u32 flags[CPUIDLE_STATE_MAX];
@@ -226,6 +240,21 @@ static int powernv_add_idle_states(void)
goto out;
}
 
+   count = of_property_count_u32_elems(power_mgt,
+   "ibm,cpu-idle-state-latencies-ns");
+
+   if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags", dt_idle_states,
+  "ibm,cpu-idle-state-latencies-ns",
+  count) != 0)
+   goto out;
+
+   count = of_property_count_strings(power_mgt,
+ "ibm,cpu-idle-state-names");
+   if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags", dt_idle_states,
+  "ibm,cpu-idle-state-names",
+  count) != 0)
+   goto out;
+
/*
 * Since snooze is used as first idle state, max idle states allowed is
 * CPUIDLE_STATE_MAX -1
@@ -260,6 +289,22 @@ static int powernv_add_idle_states(void)
has_stop_states = (flags[0] &
   (OPAL_PM_STOP_INST_FAST | OPAL_PM_STOP_INST_DEEP));
if (has_stop_states) {
+   count = of_property_count_u64_elems(power_mgt,
+   "ibm,cpu-idle-state-psscr");
+   if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags",
+  dt_idle_states,
+  "ibm,cpu-idle-state-psscr",
+  count) != 0)
+   goto out;
+
+   count = of_property_count_u64_elems(power_mgt,
+   
"ibm,cpu-idle-state-psscr-mask");
+   if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags",
+  dt_idle_states,
+  "ibm,cpu-idle-state-psscr-mask",
+  count) != 0)
+   goto out;
+
if (of_property_read_u64_array(power_mgt,
"ibm,cpu-idle-state-psscr", psscr_val, dt_idle_states)) {
pr_warn("cpuidle-powernv: missing 
ibm,cpu-idle-state-psscr in DT\n");
@@ -274,8 +319,21 @@ static int powernv_add_idle_states(void)
}
}
 
-   rc = of_property_read_u32_array(power_mgt,
-   "ibm,cpu-idle-state-residency-ns", residency_ns, 
dt_idle_states);
+   count = of_property_count_u32_elems(power_mgt,
+   "ibm,cpu-idle-state-residency-ns");
+
+   if (count < 0) {
+   rc = count;
+   } else if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags",
+ dt_idle_states,
+ "ibm,cpu-idle-state-residency-ns",
+ count) != 0) {
+   goto out;
+   } else {
+   rc = of_property_read_u32_array(power_mgt,
+   

Re: [PATCH v2 16/23] MAINTAINERS: Add file patterns for powerpc device tree bindings

2017-03-15 Thread Geert Uytterhoeven
Hi Michael,

On Wed, Mar 15, 2017 at 1:19 AM, Michael Ellerman  wrote:
> Geert Uytterhoeven  writes:
>> Submitters of device tree binding documentation may forget to CC
>> the subsystem maintainer if this is missing.
>>
>> Signed-off-by: Geert Uytterhoeven 
>> Cc: Benjamin Herrenschmidt 
>> Cc: Paul Mackerras 
>> Cc: Michael Ellerman 
>> Cc: linuxppc-dev@lists.ozlabs.org
>> ---
>> Please apply this patch directly if you want to be involved in device
>> tree binding documentation for your subsystem.
>>
>> v2:
>>   - Rebased on top of commit a42715830d552d7c ("MAINTAINERS: Remove
>> powerpc's "opal" pattern match), which just added "powerpc/opal",
>> while obviously the whole "powerpc" hierarchy is of interest.
>>
>> Impact on next-20170310:
>
> Actual diff ignoring ordering etc:
>
>  +Benjamin Herrenschmidt  (supporter:LINUX FOR 
> POWERPC (32-BIT AND 64-BIT))
>  +Paul Mackerras  (supporter:LINUX FOR POWERPC (32-BIT AND 
> 64-BIT))
>  +linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND 
> 64-BIT))
>  -Scott Wood  (commit_signer:5/11=45%)
>  -Zhao Qiang  (commit_signer:4/11=36%,authored:4/11=36%)
>  -Christian Lamparter  (commit_signer:1/11=9%)
>  -yangbo lu  (authored:1/11=9%)
>  -"Otto Kekäläinen"  (authored:1/11=9%)
>  -Chris Packham  (authored:1/11=9%)
>  -York Sun  (authored:1/11=9%)
>
> Which looks bad as all the NXP folks get dropped, but they should be
> reading linuxppc-dev. So I think I'll merge this, unless anyone
> disagrees.

They got dropped because in the absence of a maintainer entry, the
last committers are listed. If they need to be listed, I can send patches to add
more specific entries for 4xx and fsl DT bindings, like:

diff --git a/MAINTAINERS b/MAINTAINERS
index 0f1c2f96c99aa936..40b392a4f399adbe 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5266,6 +5266,7 @@ M:Scott Wood 
 L: linuxppc-dev@lists.ozlabs.org
 L: linux-arm-ker...@lists.infradead.org
 S: Maintained
+F: Documentation/devicetree/bindings/powerpc/fsl/
 F: drivers/soc/fsl/
 F: include/linux/fsl/

@@ -7540,6 +7541,7 @@ M:Matt Porter 
 W: http://www.penguinppc.org/
 L: linuxppc-dev@lists.ozlabs.org
 S: Maintained
+F: Documentation/devicetree/bindings/powerpc/4xx/
 F: arch/powerpc/platforms/40x/
 F: arch/powerpc/platforms/44x/


OK?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-03-15 Thread Michael Ellerman
Hi Peter,

Peter Zijlstra  writes:
> On Tue, Mar 14, 2017 at 02:31:51PM +0530, Madhavan Srinivasan wrote:
>
>> >Huh? PPC hasn't yet implemented this? Then why are you fixing it?
>> 
>> yes, PPC hasn't implemented this (until now).
>
> until now where?

On powerpc there is currently no kernel support for filling the data_src
value with anything meaningful.

A user can still request PERF_SAMPLE_DATA_SRC (perf report -d), but they
just get the default value from perf_sample_data_init(), which is
PERF_MEM_NA.

Though even that is currently broken with a big endian perf tool.

>> And did not understand "Then why are you fixing it?"
>
> I see no implementation; so why are you poking at it.

Maddy has posted an implementation of the kernel part for powerpc in
patch 2 of this series, but maybe you're not on Cc?


Regardless of us wanting to do the kernel side on powerpc, the current
API is broken on big endian.

That's because in the kernel the PERF_MEM_NA value is constructed using
shifts:

  /* TLB access */
  #define PERF_MEM_TLB_NA   0x01 /* not available */
  ...
  #define PERF_MEM_TLB_SHIFT26
  
  #define PERF_MEM_S(a, s) \
(((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
  
  #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\
PERF_MEM_S(LVL, NA)   |\
PERF_MEM_S(SNOOP, NA) |\
PERF_MEM_S(LOCK, NA)  |\
PERF_MEM_S(TLB, NA))

Which works out as:

  ((0x01 << 0) | (0x01 << 5) | (0x01 << 19) | (0x01 << 24) | (0x01 << 26))


Which means the PERF_MEM_NA value comes out of the kernel as 0x5080021
in CPU endian.

But then in the perf tool, the code uses the bitfields to inspect the
value, and currently the bitfields are defined using little endian
ordering.

So eg. in perf_mem__tlb_scnprintf() we see:
  data_src->val = 0x5080021
 op = 0x0
lvl = 0x0
  snoop = 0x0
   lock = 0x0
   dtlb = 0x0
   rsvd = 0x5080021


So this patch does what I think is the minimal fix, of changing the
definition of the bitfields to match the values that are already
exported by the kernel on big endian. And it makes no change on little
endian.

cheers