Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src
On Wednesday 15 March 2017 05:53 PM, Peter Zijlstra wrote: On Wed, Mar 15, 2017 at 05:20:15PM +1100, Michael Ellerman wrote: I see no implementation; so why are you poking at it. Maddy has posted an implementation of the kernel part for powerpc in patch 2 of this series, but maybe you're not on Cc? I am not indeed. That and a completely inadequate Changelog have lead to great confusion. Yes. my bad. I will send out a v3 today and will CC. Also will add ellerman's explanation to the commit message. Sorry for the confusion. Maddy
Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src
On Wednesday 15 March 2017 11:50 AM, Michael Ellerman wrote: Hi Peter, Peter Zijlstrawrites: On Tue, Mar 14, 2017 at 02:31:51PM +0530, Madhavan Srinivasan wrote: Huh? PPC hasn't yet implemented this? Then why are you fixing it? yes, PPC hasn't implemented this (until now). until now where? On powerpc there is currently no kernel support for filling the data_src value with anything meaningful. A user can still request PERF_SAMPLE_DATA_SRC (perf report -d), but they just get the default value from perf_sample_data_init(), which is PERF_MEM_NA. Though even that is currently broken with a big endian perf tool. And did not understand "Then why are you fixing it?" I see no implementation; so why are you poking at it. Maddy has posted an implementation of the kernel part for powerpc in patch 2 of this series, but maybe you're not on Cc? Sorry, was out yesterday. Yes my bad. I CCed lkml and ppcdev and took the emails from get_maintainer script and added to each file. I will send out a v3 with peterz and others in all patch. Regardless of us wanting to do the kernel side on powerpc, the current API is broken on big endian. That's because in the kernel the PERF_MEM_NA value is constructed using shifts: /* TLB access */ #define PERF_MEM_TLB_NA 0x01 /* not available */ ... #define PERF_MEM_TLB_SHIFT 26 #define PERF_MEM_S(a, s) \ (((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT) #define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\ PERF_MEM_S(LVL, NA) |\ PERF_MEM_S(SNOOP, NA) |\ PERF_MEM_S(LOCK, NA) |\ PERF_MEM_S(TLB, NA)) Which works out as: ((0x01 << 0) | (0x01 << 5) | (0x01 << 19) | (0x01 << 24) | (0x01 << 26)) Which means the PERF_MEM_NA value comes out of the kernel as 0x5080021 in CPU endian. But then in the perf tool, the code uses the bitfields to inspect the value, and currently the bitfields are defined using little endian ordering. So eg. in perf_mem__tlb_scnprintf() we see: data_src->val = 0x5080021 op = 0x0 lvl = 0x0 snoop = 0x0 lock = 0x0 dtlb = 0x0 rsvd = 0x5080021 So this patch does what I think is the minimal fix, of changing the definition of the bitfields to match the values that are already exported by the kernel on big endian. And it makes no change on little endian. Thanks for the detailed explanation. I will add this to the patch commit message in the v3. Maddy cheers
Re: [PATCH v3 4/7] macintosh: Only descend into directory when CONFIG_MACINTOSH_DRIVERS is set
"Andrew F. Davis"writes: > When CONFIG_MACINTOSH_DRIVERS is not set make will still descend into the > macintosh directory but nothing will be built. This produces unneeded > build artifacts and messages in addition to slowing the build. > Fix this here. > > Signed-off-by: Andrew F. Davis > --- > drivers/Makefile | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) LGTM. Acked-by: Michael Ellerman cheersj
Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
On Wed, Mar 15, 2017 at 10:18:18AM -0600, Alex Williamson wrote: > On Wed, 15 Mar 2017 15:40:14 +1100 > David Gibsonwrote: > > > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c > > > > b/arch/powerpc/kvm/book3s_64_vio.c > > > > index e96a4590464c..be18cda01e1b 100644 > > > > --- a/arch/powerpc/kvm/book3s_64_vio.c > > > > +++ b/arch/powerpc/kvm/book3s_64_vio.c > > > > @@ -28,6 +28,10 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > > > > > #include > > > > #include > > > > @@ -40,6 +44,36 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > > > + > > > > +static void kvm_vfio_group_put_external_user(struct vfio_group > > > > *vfio_group) > > > > +{ > > > > + void (*fn)(struct vfio_group *); > > > > + > > > > + fn = symbol_get(vfio_group_put_external_user); > > > > + if (WARN_ON(!fn)) > > > > + return; > > > > + > > > > + fn(vfio_group); > > > > + > > > > + symbol_put(vfio_group_put_external_user); > > > > +} > > > > + > > > > +static int kvm_vfio_external_user_iommu_id(struct vfio_group > > > > *vfio_group) > > > > +{ > > > > + int (*fn)(struct vfio_group *); > > > > + int ret = -1; > > > > + > > > > + fn = symbol_get(vfio_external_user_iommu_id); > > > > + if (!fn) > > > > + return ret; > > > > + > > > > + ret = fn(vfio_group); > > > > + > > > > + symbol_put(vfio_external_user_iommu_id); > > > > + > > > > + return ret; > > > > +} > > > > > > > > > Ugh. This feels so wrong. Why can't you have kvm-vfio pass the > > > iommu_group? Why do you need to hold this additional vfio_group > > > reference? > > > > Keeping the vfio_group reference makes sense to me, since we don't > > want the vfio context for the group to go away while it's attached to > > the LIOBN. > > But there's already a reference for that, it's taken by > KVM_DEV_VFIO_GROUP_ADD and held until KVM_DEV_VFIO_GROUP_DEL. Both the > DEL path and the cleanup path call kvm_spapr_tce_release_iommu_group() > before releasing that reference, so it seems entirely redundant. Oh, good point. And we already verify that the group has been ADDed before setting the LIOBN association. > > However, going via the iommu_id rather than just having an interface > > to directly grab the iommu group from the vfio_group seems bizarre to > > me. I'm ok with cleaning that up later, however. > > We have kvm_spapr_tce_attach_iommu_group() and > kvm_spapr_tce_release_iommu_group(), but both take a vfio_group, not an > iommu_group as a parameter. I don't particularly have a problem with > the vfio_group -> iommu ID -> iommu_group, but if we drop the extra > vfio_group reference and pass the iommu_group itself to these functions > then we can keep all the symbol reference stuff in the kvm-vfio glue > layer. Thanks, Makes sense. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
[PATCH] powerpc: Teach oops about radix vectors
Currently if we oops caused by an 0x380 or 0x480 exception, we get a print which assumes SLB problems. With radix, these vectors have different meanings. This patch updates the oops message to reflect these different meanings. Signed-off-by: Michael Neuling--- arch/powerpc/xmon/xmon.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 16321ad9e7..7781f78964 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -1347,9 +1347,19 @@ const char *getvecname(unsigned long vec) case 0x100: ret = "(System Reset)"; break; case 0x200: ret = "(Machine Check)"; break; case 0x300: ret = "(Data Access)"; break; - case 0x380: ret = "(Data SLB Access)"; break; + case 0x380: + if (radix_enabled()) + ret = "(Data Access Out of Range)"; + else + ret = "(Data SLB Access)"; + break; case 0x400: ret = "(Instruction Access)"; break; - case 0x480: ret = "(Instruction SLB Access)"; break; + case 0x480: + if (radix_enabled()) + ret = "(Instruction Access Out of Range)"; + else + ret = "(Instruction SLB Access)"; + break; case 0x500: ret = "(Hardware Interrupt)"; break; case 0x600: ret = "(Alignment)"; break; case 0x700: ret = "(Program Check)"; break; -- 2.9.3
Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
On 16/03/17 03:39, Alex Williamson wrote: > On Thu, 16 Mar 2017 00:21:07 +1100 > Alexey Kardashevskiywrote: > >> On 15/03/17 08:05, Alex Williamson wrote: >>> On Fri, 10 Mar 2017 14:53:37 +1100 >>> Alexey Kardashevskiy wrote: >>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO without passing them to user space which saves time on switching to user space and back. This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM. KVM tries to handle a TCE request in the real mode, if failed it passes the request to the virtual mode to complete the operation. If it a virtual mode handler fails, the request is passed to the user space; this is not expected to happen though. To avoid dealing with page use counters (which is tricky in real mode), this only accelerates SPAPR TCE IOMMU v2 clients which are required to pre-register the userspace memory. The very first TCE request will be handled in the VFIO SPAPR TCE driver anyway as the userspace view of the TCE table (iommu_table::it_userspace) is not allocated till the very first mapping happens and we cannot call vmalloc in real mode. If we fail to update a hardware IOMMU table unexpected reason, we just clear it and move on as there is nothing really we can do about it - for example, if we hot plug a VFIO device to a guest, existing TCE tables will be mirrored automatically to the hardware and there is no interface to report to the guest about possible failures. This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd and associates a physical IOMMU table with the SPAPR TCE table (which is a guest view of the hardware IOMMU table). The iommu_table object is cached and referenced so we do not have to look up for it in real mode. This does not implement the UNSET counterpart as there is no use for it - once the acceleration is enabled, the existing userspace won't disable it unless a VFIO container is destroyed; this adds necessary cleanup to the KVM_DEV_VFIO_GROUP_DEL handler. As this creates a descriptor per IOMMU table-LIOBN couple (called kvmppc_spapr_tce_iommu_table), it is possible to have several descriptors with the same iommu_table (hardware IOMMU table) attached to the same LIOBN; we do not remove duplicates though as iommu_table_ops::exchange not just update a TCE entry (which is shared among IOMMU groups) but also invalidates the TCE cache (one per IOMMU group). This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user space. This adds real mode version of WARN_ON_ONCE() as the generic version causes problems with rcu_sched. Since we testing what vmalloc_to_phys() returns in the code, this also adds a check for already existing vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect(). This finally makes use of vfio_external_user_iommu_id() which was introduced quite some time ago and was considered for removal. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Signed-off-by: Alexey Kardashevskiy --- Changes: v8: * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed to handle them * changed vmalloc_to_phys() callers to return H_HARDWARE * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD and added a comment about this in the code * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE and do WARN_ON * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to have all vmalloc_to_phys() callsites covered v7: * added realmode-friendly WARN_ON_ONCE_RM v6: * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map() * moved kvmppc_gpa_to_ua() to TCE validation v5: * changed error codes in multiple places * added bunch of WARN_ON() in places which should not really happen * adde a check that an iommu table is not attached already to LIOBN * dropped explicit calls to iommu_tce_clear_param_check/ iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate call them anyway (since the previous patch) * if we fail to update a hardware IOMMU table for unexpected reason, this just clears the entry v4: * added note to the commit log about allowing multiple updates of the same IOMMU table; * instead of checking for if any memory was preregistered, this returns H_TOO_HARD if a specific page was not; * fixed
Re: [PATCH v3 0/7] Remove unneeded build directory traversals
On 03/15/2017 04:03 PM, Arnd Bergmann wrote: > On Wed, Mar 15, 2017 at 5:37 PM, Andrew F. Daviswrote: >> Hello all, >> >> I was building a kernel for x86 and noticed Make still descended into >> directories like drivers/gpu/drm/hisilicon, this seems kind of odd given >> nothing will be built here. It looks to be due to some directories being >> included in obj-y unconditionally instead of only when the relevant >> CONFIG_ is set. >> >> These patches are split by subsystem in-case, for some reason, a file in >> a directory does need to be built, I believe I have checked for all >> instances of this, but a quick review from some maintainers would be nice. > > I didn't see anything wrong with the patches, and made sure that there > are no tristate symbols controlling the subdirectory for anything that > requires a built-in driver (which would cause a link failure). > > I'm not sure about drivers/lguest, which has some special magic > in its Makefile, it's possible that this now fails with CONFIG_LGUEST=m. > lguest and mmc are the strange ones, so I put them last in the series in case they did need to be dropped. lguest was supposed to have been taken from v1: https://lkml.org/lkml/2016/6/20/1086 but it looks like it didn't so I re-introduced it for v3. mmc caught some 0-day build warnings but I never got to the bottom of them. Anyway, I have no problem with these two being held back until the magic in their Makefile is sorted out. Thanks, Andrew > Arnd >
RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm
Hi David, > While not part of this change, the unrolled loops look as though > they just destroy the cpu cache. > I'd like be convinced that anything does CRC over long enough buffers > to make it a gain at all. > > With modern (not that modern now) superscalar cpus you can often > get the loop instructions 'for free'. > Sometimes pipelining the loop is needed to get full throughput. > Unlike the IP checksum, you don't even have to 'loop carry' the > cpu carry flag. Internal testing on a NVMe device with T10DIF enabled on 4k blocks shows a 20x - 30x improvement. Without these patches, crc_t10dif_generic uses over 60% of CPU time - with these patches CRC drops to single digits. I should probably have lead with that, sorry. FWIW, the original patch showed a 3.7x gain on btrfs as well - 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c") When Anton wrote the original code he had access to IBM's internal tooling for looking at how instructions flow through the various stages of the CPU, so I trust it's pretty much optimal from that point of view. Regards, Daniel
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, 15 Mar 2017, Rich Felker wrote: > On Wed, Mar 15, 2017 at 12:44:47PM -0700, Till Smejkal wrote: > > On Wed, 15 Mar 2017, Andy Lutomirski wrote: > > > > One advantage of VAS segments is that they can be globally queried by > > > > user programs > > > > which means that VAS segments can be shared by applications that not > > > > necessarily have > > > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data > > > > will only work > > > > if the tasks that share the memory region are related (aka. have a > > > > common parent that > > > > initialized the shared mapping). Otherwise, the shared mapping have to > > > > be backed by a > > > > file. > > > > > > What's wrong with memfd_create()? > > > > > > > VAS segments on the other side allow sharing of pure in memory data by > > > > arbitrary related tasks without the need of a file. This becomes > > > > especially > > > > interesting if one combines VAS segments with non-volatile memory since > > > > one can keep > > > > data structures in the NVM and still be able to share them between > > > > multiple tasks. > > > > > > What's wrong with regular mmap? > > > > I never wanted to say that there is something wrong with regular mmap. We > > just > > figured that with VAS segments you could remove the need to mmap your > > shared data but > > instead can keep everything purely in memory. > > > > Unfortunately, I am not at full speed with memfds. Is my understanding > > correct that > > if the last user of such a file descriptor closes it, the corresponding > > memory is > > freed? Accordingly, memfd cannot be used to keep data in memory while no > > program is > > currently using it, can it? To be able to do this you need again some > > representation > > I have a name for application-allocated kernel resources that persist > without a process holding a reference to them or a node in the > filesystem: a bug. See: sysvipc. VAS segments are first class citizens of the OS similar to processes. Accordingly, I would not see this behavior as a bug. VAS segments are a kernel handle to "persistent" memory (in the sense that they are independent of the lifetime of the application that created them). That means the memory that is described by VAS segments can be reused by other applications even if the VAS segment was not used by any application in between. It is very much like a pure in-memory file. An application creates a VAS segment, fills it with content and if it does not delete it again, can reuse/open it again later. This also means, that if you know that you never want to use this memory again you have to remove it explicitly, like you have to remove a file, if you don't want to use it anymore. I think it really might be better to implement VAS segments (if I should keep this feature at all) with a special purpose filesystem. The way I've designed it seams to be very misleading. Till
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, Mar 15, 2017 at 12:44 PM, Till Smejkalwrote: > On Wed, 15 Mar 2017, Andy Lutomirski wrote: >> > One advantage of VAS segments is that they can be globally queried by user >> > programs >> > which means that VAS segments can be shared by applications that not >> > necessarily have >> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data >> > will only work >> > if the tasks that share the memory region are related (aka. have a common >> > parent that >> > initialized the shared mapping). Otherwise, the shared mapping have to be >> > backed by a >> > file. >> >> What's wrong with memfd_create()? >> >> > VAS segments on the other side allow sharing of pure in memory data by >> > arbitrary related tasks without the need of a file. This becomes especially >> > interesting if one combines VAS segments with non-volatile memory since >> > one can keep >> > data structures in the NVM and still be able to share them between >> > multiple tasks. >> >> What's wrong with regular mmap? > > I never wanted to say that there is something wrong with regular mmap. We just > figured that with VAS segments you could remove the need to mmap your shared > data but > instead can keep everything purely in memory. memfd does that. > > Unfortunately, I am not at full speed with memfds. Is my understanding > correct that > if the last user of such a file descriptor closes it, the corresponding > memory is > freed? Accordingly, memfd cannot be used to keep data in memory while no > program is > currently using it, can it? No, stop right here. If you want to have a bunch of memory that outlives the program that allocates it, use a filesystem (tmpfs, hugetlbfs, ext4, whatever). Don't create new persistent kernel things. > VAS segments on the other side would provide a functionality to > achieve the same without the need of any mounted filesystem. However, I > agree, that > this is just a small advantage compared to what can already be achieved with > the > existing functionality provided by the Linux kernel. I see this "small advantage" as "resource leak and security problem". >> This sounds complicated and fragile. What happens if a heuristically >> shared region coincides with a region in the "first class address >> space" being selected? > > If such a conflict happens, the task cannot use the first class address space > and the > corresponding system call will return an error. However, with the current > available > virtual address space size that programs can use, such conflicts are probably > rare. A bug that hits 1% of the time is often worse than one that hits 100% of the time because debugging it is miserable. --Andy
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, Mar 15, 2017 at 12:44:47PM -0700, Till Smejkal wrote: > On Wed, 15 Mar 2017, Andy Lutomirski wrote: > > > One advantage of VAS segments is that they can be globally queried by > > > user programs > > > which means that VAS segments can be shared by applications that not > > > necessarily have > > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data > > > will only work > > > if the tasks that share the memory region are related (aka. have a common > > > parent that > > > initialized the shared mapping). Otherwise, the shared mapping have to be > > > backed by a > > > file. > > > > What's wrong with memfd_create()? > > > > > VAS segments on the other side allow sharing of pure in memory data by > > > arbitrary related tasks without the need of a file. This becomes > > > especially > > > interesting if one combines VAS segments with non-volatile memory since > > > one can keep > > > data structures in the NVM and still be able to share them between > > > multiple tasks. > > > > What's wrong with regular mmap? > > I never wanted to say that there is something wrong with regular mmap. We just > figured that with VAS segments you could remove the need to mmap your shared > data but > instead can keep everything purely in memory. > > Unfortunately, I am not at full speed with memfds. Is my understanding > correct that > if the last user of such a file descriptor closes it, the corresponding > memory is > freed? Accordingly, memfd cannot be used to keep data in memory while no > program is > currently using it, can it? To be able to do this you need again some > representation I have a name for application-allocated kernel resources that persist without a process holding a reference to them or a node in the filesystem: a bug. See: sysvipc. Rich
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, 15 Mar 2017, Andy Lutomirski wrote: > > One advantage of VAS segments is that they can be globally queried by user > > programs > > which means that VAS segments can be shared by applications that not > > necessarily have > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will > > only work > > if the tasks that share the memory region are related (aka. have a common > > parent that > > initialized the shared mapping). Otherwise, the shared mapping have to be > > backed by a > > file. > > What's wrong with memfd_create()? > > > VAS segments on the other side allow sharing of pure in memory data by > > arbitrary related tasks without the need of a file. This becomes especially > > interesting if one combines VAS segments with non-volatile memory since one > > can keep > > data structures in the NVM and still be able to share them between multiple > > tasks. > > What's wrong with regular mmap? I never wanted to say that there is something wrong with regular mmap. We just figured that with VAS segments you could remove the need to mmap your shared data but instead can keep everything purely in memory. Unfortunately, I am not at full speed with memfds. Is my understanding correct that if the last user of such a file descriptor closes it, the corresponding memory is freed? Accordingly, memfd cannot be used to keep data in memory while no program is currently using it, can it? To be able to do this you need again some representation of the data in a file? Yes, you can use a tmpfs to keep the file content in memory as well, or some DAX filesystem to keep the file content in NVM, but this always requires that such filesystems are mounted in the system that the application is currently running on. VAS segments on the other side would provide a functionality to achieve the same without the need of any mounted filesystem. However, I agree, that this is just a small advantage compared to what can already be achieved with the existing functionality provided by the Linux kernel. I probably need to revisit the whole idea of first class virtual address space segments before continuing with this pacthset. Thank you very much for the great feedback. > >> >> Ick. Please don't do this. Can we please keep an mm as just an mm > >> >> and not make it look magically different depending on which process > >> >> maps it? If you need a trampoline (which you do, of course), just > >> >> write a trampoline in regular user code and map it manually. > >> > > >> > Did I understand you correctly that you are proposing that the switching > >> > thread > >> > should make sure by itself that its code, stack, … memory regions are > >> > properly setup > >> > in the new AS before/after switching into it? I think, this would make > >> > using first > >> > class virtual address spaces much more difficult for user applications > >> > to the extend > >> > that I am not even sure if they can be used at all. At the moment, > >> > switching into a > >> > VAS is a very simple operation for an application because the kernel > >> > will just simply > >> > do the right thing. > >> > >> Yes. I think that having the same mm_struct look different from > >> different tasks is problematic. Getting it right in the arch code is > >> going to be nasty. The heuristics of what to share are also tough -- > >> why would text + data + stack or whatever you're doing be adequate? > >> What if you're in a thread? What if two tasks have their stacks in > >> the same place? > > > > The different ASes that a task now can have when it uses first class > > virtual address > > spaces are not realized in the kernel by using only one mm_struct per task > > that just > > looks differently but by using multiple mm_structs - one for each AS that > > the task > > can execute in. When a task attaches a first class virtual address space to > > itself to > > be able to use another AS, the kernel adds a temporary mm_struct to this > > task that > > contains the mappings of the first class virtual address space and the one > > shared > > with the task's original AS. If a thread now wants to switch into this > > attached first > > class virtual address space the kernel only changes the 'mm' and > > 'active_mm' pointers > > in the task_struct of the thread to the temporary mm_struct and performs the > > corresponding mm_switch operation. The original mm_struct of the thread > > will not be > > changed. > > > > Accordingly, I do not magically make mm_structs look differently depending > > on the > > task that uses it, but create temporary mm_structs that only contain > > mappings to the > > same memory regions. > > This sounds complicated and fragile. What happens if a heuristically > shared region coincides with a region in the "first class address > space" being selected? If such a conflict happens, the task cannot use the first class address space and the corresponding system call will return an error.
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, Mar 15, 2017 at 09:51:31AM -0700, Andy Lutomirski wrote: > > VAS segments on the other side allow sharing of pure in memory data by > > arbitrary related tasks without the need of a file. This becomes especially > > interesting if one combines VAS segments with non-volatile memory since one > > can keep > > data structures in the NVM and still be able to share them between multiple > > tasks. > > What's wrong with regular mmap? I think it's the usual misunderstandings about how to use mmap. >From the paper: Memory-centric computing demands careful organization of the virtual address space, but interfaces such as mmap only give limited control. Some systems do not support creation of address regions at specific offsets. In Linux, for example, mmap does not safely abort if a request is made to open a region of memory over an existing region; it simply writes over it. The correct answer of course, is "Don't specify MAP_FIXED". Specify the 'hint' address, and if you don't get it, either fix up your data structure pointers, or just abort and complain noisily.
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Tue, Mar 14, 2017 at 9:12 AM, Till Smejkalwrote: > On Mon, 13 Mar 2017, Andy Lutomirski wrote: >> On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal >> wrote: >> > On Mon, 13 Mar 2017, Andy Lutomirski wrote: >> >> This sounds rather complicated. Getting TLB flushing right seems >> >> tricky. Why not just map the same thing into multiple mms? >> > >> > This is exactly what happens at the end. The memory region that is >> > described by the >> > VAS segment will be mapped in the ASes that use the segment. >> >> So why is this kernel feature better than just doing MAP_SHARED >> manually in userspace? > > One advantage of VAS segments is that they can be globally queried by user > programs > which means that VAS segments can be shared by applications that not > necessarily have > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will > only work > if the tasks that share the memory region are related (aka. have a common > parent that > initialized the shared mapping). Otherwise, the shared mapping have to be > backed by a > file. What's wrong with memfd_create()? > VAS segments on the other side allow sharing of pure in memory data by > arbitrary related tasks without the need of a file. This becomes especially > interesting if one combines VAS segments with non-volatile memory since one > can keep > data structures in the NVM and still be able to share them between multiple > tasks. What's wrong with regular mmap? > >> >> Ick. Please don't do this. Can we please keep an mm as just an mm >> >> and not make it look magically different depending on which process >> >> maps it? If you need a trampoline (which you do, of course), just >> >> write a trampoline in regular user code and map it manually. >> > >> > Did I understand you correctly that you are proposing that the switching >> > thread >> > should make sure by itself that its code, stack, … memory regions are >> > properly setup >> > in the new AS before/after switching into it? I think, this would make >> > using first >> > class virtual address spaces much more difficult for user applications to >> > the extend >> > that I am not even sure if they can be used at all. At the moment, >> > switching into a >> > VAS is a very simple operation for an application because the kernel will >> > just simply >> > do the right thing. >> >> Yes. I think that having the same mm_struct look different from >> different tasks is problematic. Getting it right in the arch code is >> going to be nasty. The heuristics of what to share are also tough -- >> why would text + data + stack or whatever you're doing be adequate? >> What if you're in a thread? What if two tasks have their stacks in >> the same place? > > The different ASes that a task now can have when it uses first class virtual > address > spaces are not realized in the kernel by using only one mm_struct per task > that just > looks differently but by using multiple mm_structs - one for each AS that the > task > can execute in. When a task attaches a first class virtual address space to > itself to > be able to use another AS, the kernel adds a temporary mm_struct to this task > that > contains the mappings of the first class virtual address space and the one > shared > with the task's original AS. If a thread now wants to switch into this > attached first > class virtual address space the kernel only changes the 'mm' and 'active_mm' > pointers > in the task_struct of the thread to the temporary mm_struct and performs the > corresponding mm_switch operation. The original mm_struct of the thread will > not be > changed. > > Accordingly, I do not magically make mm_structs look differently depending on > the > task that uses it, but create temporary mm_structs that only contain mappings > to the > same memory regions. This sounds complicated and fragile. What happens if a heuristically shared region coincides with a region in the "first class address space" being selected? I think the right solution is "you're a user program playing virtual address games -- make sure you do it right". --Andy
Re: [PATCH v3 0/7] Remove unneeded build directory traversals
On Wed, Mar 15, 2017 at 10:15 PM, Andrew F. Daviswrote: > On 03/15/2017 04:03 PM, Arnd Bergmann wrote: >> On Wed, Mar 15, 2017 at 5:37 PM, Andrew F. Davis wrote: >>> Hello all, >>> >>> I was building a kernel for x86 and noticed Make still descended into >>> directories like drivers/gpu/drm/hisilicon, this seems kind of odd given >>> nothing will be built here. It looks to be due to some directories being >>> included in obj-y unconditionally instead of only when the relevant >>> CONFIG_ is set. >>> >>> These patches are split by subsystem in-case, for some reason, a file in >>> a directory does need to be built, I believe I have checked for all >>> instances of this, but a quick review from some maintainers would be nice. >> >> I didn't see anything wrong with the patches, and made sure that there >> are no tristate symbols controlling the subdirectory for anything that >> requires a built-in driver (which would cause a link failure). >> >> I'm not sure about drivers/lguest, which has some special magic >> in its Makefile, it's possible that this now fails with CONFIG_LGUEST=m. >> > > lguest and mmc are the strange ones, so I put them last in the series in > case they did need to be dropped. > > lguest was supposed to have been taken from v1: > https://lkml.org/lkml/2016/6/20/1086 > but it looks like it didn't so I re-introduced it for v3. > > mmc caught some 0-day build warnings but I never got to the bottom of them. Ah, I see now what happened to mmc: obj-$(subst m,y,$(CONFIG_MMC)) += host/ tmio_mmc_core-$(subst m,y,$(CONFIG_MMC_SDHI)) += tmio_mmc_dma.o obj-$(subst m,y,$(CONFIG_MMC_SDHCI_PCI))+= sdhci-pci-data.o with CONFIG_MMC=m, this will fail to build the built-in files in drivers/mmc/host. I suppose this could be expressed in a different way these days, but dropping the patch would be easier. Arnd
Re: [PATCH v3 0/7] Remove unneeded build directory traversals
On Wed, Mar 15, 2017 at 5:37 PM, Andrew F. Daviswrote: > Hello all, > > I was building a kernel for x86 and noticed Make still descended into > directories like drivers/gpu/drm/hisilicon, this seems kind of odd given > nothing will be built here. It looks to be due to some directories being > included in obj-y unconditionally instead of only when the relevant > CONFIG_ is set. > > These patches are split by subsystem in-case, for some reason, a file in > a directory does need to be built, I believe I have checked for all > instances of this, but a quick review from some maintainers would be nice. I didn't see anything wrong with the patches, and made sure that there are no tristate symbols controlling the subdirectory for anything that requires a built-in driver (which would cause a link failure). I'm not sure about drivers/lguest, which has some special magic in its Makefile, it's possible that this now fails with CONFIG_LGUEST=m. Arnd
Re: [GIT PULL 00/19] perf/core improvements and fixes
* Arnaldo Carvalho de Melowrote: > Hi Ingo, > > Please consider pulling, > > - Arnaldo > > Test results at the end of this message, as usual. > > The following changes since commit 84e5b549214f2160c12318aac549de85f600c79a: > > Merge tag 'perf-core-for-mingo-4.11-20170306' of > git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core > (2017-03-07 08:14:14 +0100) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git > tags/perf-core-for-mingo-4.12-20170314 > > for you to fetch changes up to 5f6bee34707973ea7879a7857fd63ddccc92fff3: > > kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL > (2017-03-14 15:17:40 -0300) > > > perf/core improvements and fixes: > > New features: > > - Add PERF_RECORD_NAMESPACES so that the kernel can record information > required to associate samples to namespaces, helping in container > problem characterization. > > Now the 'perf record has a --namespace' option to ask for such info, > and when present, it can be used, initially, via a new sort order, > 'cgroup_id', allowing histogram entry bucketization by a (device, inode) > based cgroup identifier (Hari Bathini) > > - Add --next option to 'perf sched timehist', showing what is the next > thread to run (Brendan Gregg) > > Fixes: > > - Fix segfault with basic block 'cycles' sort dimension (Changbin Du) > > - Add c2c to command-list.txt, making it appear in the 'perf help' > output (Changbin Du) > > - Fix zeroing of 'abs_path' variable in the perf hists browser switch > file code (Changbin Du) > > - Hide tips messages when -q/--quiet is given to 'perf report' (Namhyung Kim) > > Infrastructure: > > - Use ref_reloc_sym + offset to setup kretprobes (Naveen Rao) > > - Ignore generated files pmu-events/{jevents,pmu-events.c} for git (Changbin > Du) > > Documentation: > > - Document +field style argument support for --field option (Changbin Du) > > - Clarify 'perf c2c --stats' help message (Namhyung Kim) > > Signed-off-by: Arnaldo Carvalho de Melo > > > Brendan Gregg (1): > perf sched timehist: Add --next option > > Changbin Du (5): > perf tools: Missing c2c command in command-list > perf tools: Ignore generated files pmu-events/{jevents,pmu-events.c} > for git > perf sort: Fix segfault with basic block 'cycles' sort dimension > perf report: Document +field style argument support for --field option > perf hists browser: Fix typo in function switch_data_file > > Hari Bathini (5): > perf: Add PERF_RECORD_NAMESPACES to include namespaces related info > perf tools: Add PERF_RECORD_NAMESPACES to include namespaces related > info > perf record: Synthesize namespace events for current processes > perf script: Add script print support for namespace events > perf tools: Add 'cgroup_id' sort order keyword > > Namhyung Kim (3): > perf report: Hide tip message when -q option is given > perf c2c: Clarify help message of --stats option > perf c2c: Fix display bug when using pipe > > Naveen N. Rao (5): > perf probe: Factor out the ftrace README scanning > perf kretprobes: Offset from reloc_sym if kernel supports it > perf powerpc: Choose local entry point with kretprobes > doc: trace/kprobes: add information about NOKPROBE_SYMBOL > kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL > > Documentation/trace/kprobetrace.txt | 5 +- > include/linux/perf_event.h | 2 + > include/uapi/linux/perf_event.h | 32 +- > kernel/events/core.c| 139 ++ > kernel/fork.c | 2 + > kernel/kprobes.c| 5 +- > kernel/nsproxy.c| 3 + > tools/include/uapi/linux/perf_event.h | 32 +- > tools/perf/.gitignore | 2 + > tools/perf/Documentation/perf-record.txt| 3 + > tools/perf/Documentation/perf-report.txt| 7 +- > tools/perf/Documentation/perf-sched.txt | 4 + > tools/perf/Documentation/perf-script.txt| 3 + > tools/perf/arch/powerpc/util/sym-handling.c | 14 ++- > tools/perf/builtin-annotate.c | 1 + > tools/perf/builtin-c2c.c| 4 +- > tools/perf/builtin-diff.c | 1 + > tools/perf/builtin-inject.c | 13 +++ > tools/perf/builtin-kmem.c | 1 + > tools/perf/builtin-kvm.c| 2 + > tools/perf/builtin-lock.c | 1 + > tools/perf/builtin-mem.c| 1 + > tools/perf/builtin-record.c | 35 ++- > tools/perf/builtin-report.c
Re: ioctl structs differ from x86_64?
On Wed, Mar 15, 2017 at 01:11:19PM -0500, Reza Arbab wrote: https://groups.google.com/forum/#!topic/golang-nuts/K5NoG8slez0 Oops. https://groups.google.com/d/msg/golang-nuts/K5NoG8slez0/mixUse17iaMJ -- Reza Arbab
Re: [PATCH] drivers/pcmcia: NO_IRQ removal for electra_cf.c
On Wed, 2017-03-15 at 16:35 +1100, Michael Ellerman wrote: > Arnd Bergmannwrites: > > > > > On Tue, Mar 14, 2017 at 11:51 AM, Michael Ellerman > > wrote: > > > > > > Michael Ellerman writes: > > > > > > > > > > > We'd like to eventually remove NO_IRQ on powerpc, so remove usages of > > > > it > > > > from electra_cf.c which is a powerpc-only driver. > > > > > > > > Signed-off-by: Michael Ellerman > > > > --- > > > > drivers/pcmcia/electra_cf.c | 4 ++-- > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > Ping anyone? > > > > > > Or should I merge this via the powerpc tree? > > That's what I would recommend for a powerpc specific pcmcia driver, yes. > Suits me. > > > > > Looking at the bigger picture of powerpc drivers using NO_IRQ, I also > > see these others: > > > > drivers/ata/pata_mpc52xx.c: if (ata_irq == NO_IRQ) { > > drivers/ata/sata_dwc_460ex.c:#ifndef NO_IRQ > > drivers/ata/sata_dwc_460ex.c:#define NO_IRQ 0 > > drivers/ata/sata_dwc_460ex.c: if (hsdev->dma->irq == NO_IRQ) { > > drivers/ata/sata_dwc_460ex.c: if (irq == NO_IRQ) { > > drivers/iommu/fsl_pamu.c: if (irq == NO_IRQ) { > > drivers/iommu/fsl_pamu.c: if (irq != NO_IRQ) > > drivers/media/platform/fsl-viu.c: if (viu_irq == NO_IRQ) { > > drivers/mtd/nand/mpc5121_nfc.c: if (prv->irq == NO_IRQ) { > > drivers/pcmcia/electra_cf.c:cf->irq = NO_IRQ; > > drivers/pcmcia/electra_cf.c:if (cf->irq != NO_IRQ) > > drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if > > no interrupt is pending. */ > > drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ; > > drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if > > no interrupt is pending. */ > > drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ; > > drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_low == NO_IRQ) { > > drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_high != NO_IRQ && > > drivers/spi/spi-mpc52xx.c: if (status && (irq != NO_IRQ)) > > drivers/tty/ehv_bytechan.c: if (stdout_irq == NO_IRQ) { > > drivers/tty/ehv_bytechan.c: if ((bc->rx_irq == NO_IRQ) || > > (bc->tx_irq == NO_IRQ)) { > > drivers/tty/serial/cpm_uart/cpm_uart_core.c:if (pinfo->port.irq == > > NO_IRQ) { > > drivers/uio/uio_fsl_elbc_gpcm.c:if (irq != NO_IRQ) { > > drivers/uio/uio_fsl_elbc_gpcm.c:irq = NO_IRQ; > > drivers/uio/uio_fsl_elbc_gpcm.c: irq != NO_IRQ ? irq : > > -1); > > drivers/usb/host/ehci-grlib.c: if (irq == NO_IRQ) { > > drivers/usb/host/ehci-ppc-of.c: if (irq == NO_IRQ) { > > drivers/usb/host/fhci-hcd.c:if (usb_irq == NO_IRQ) { > > drivers/usb/host/ohci-ppc-of.c: if (irq == NO_IRQ) { > > drivers/usb/host/uhci-grlib.c: if (irq == NO_IRQ) { > > drivers/video/fbdev/mb862xx/mb862xxfbdrv.c: if (par->irq == NO_IRQ) { > > drivers/virt/fsl_hypervisor.c: if (!handle || (irq == NO_IRQ)) { > > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) > > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) > > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) > > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) > > include/soc/fsl/qe/qe_ic.h: if (cascade_irq == NO_IRQ) > > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) > > > > Did you have other pending patches for those? > No. I stayed away from anything FSL related as I was under the > impression some of them were being ported to arch/arm, which uses -1 for > NO_IRQ IIUIC. > > eg. all of include/soc/fsl and drivers/soc/fsl was moved from > arch/powerpc in commit 7aa1aa6ecec2, which said: > > QE: Move QE from arch/powerpc to drivers/soc > > ls1 has qe and ls1 has arm cpu. > move qe from arch/powerpc to drivers/soc/fsl > to adapt to powerpc and arm > > But looking at the Kconfigs it looks like they're still only selectable > on PPC. So that's a bit annoying. > > I'll do patches for everything above that's not drivers/soc or > include/soc and hopefully we can hear from someone at NXP on the plans > for getting the soc parts enabled on arm. qe_ic is handled by https://lkml.org/lkml/2017/3/13/1234 -Scott
Applied "ASoC: fsl: constify snd_soc_ops structures" to the asoc tree
The patch ASoC: fsl: constify snd_soc_ops structures has been applied to the asoc tree at git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git All being well this means that it will be integrated into the linux-next tree (usually sometime in the next 24 hours) and sent to Linus during the next merge window (or sooner if it is a bug fix), however if problems are discovered then the patch may be dropped or reverted. You may get further e-mails resulting from automated or manual testing and review of the tree, please engage with people reporting problems and send followup patches addressing any issues that are reported if needed. If any updates are required or you are submitting further changes they should be sent as incremental updates against current git, existing patches will not be replaced. Please add any relevant lists and maintainers to the CCs when replying to this mail. Thanks, Mark >From 5ace37bd7947e28dec5559a57ddc6e1d997dbec5 Mon Sep 17 00:00:00 2001 From: Bhumika GoyalDate: Tue, 14 Mar 2017 00:42:22 +0530 Subject: [PATCH] ASoC: fsl: constify snd_soc_ops structures Declare snd_soc_ops structures as const as they are only stored in the ops field of a snd_soc_dai_link structure. This field is of type const, so snd_soc_ops structures having this property can be made const too. The following .o files did not compile: sound/soc/fsl/{p1022_rdk.c/p1022_ds.c/mpc8610_hpcd.c} Signed-off-by: Bhumika Goyal Signed-off-by: Mark Brown --- sound/soc/fsl/eukrea-tlv320.c | 2 +- sound/soc/fsl/imx-mc13783.c | 2 +- sound/soc/fsl/mpc8610_hpcd.c| 2 +- sound/soc/fsl/mx27vis-aic32x4.c | 2 +- sound/soc/fsl/p1022_ds.c| 2 +- sound/soc/fsl/p1022_rdk.c | 2 +- sound/soc/fsl/phycore-ac97.c| 2 +- sound/soc/fsl/wm1133-ev1.c | 2 +- 8 files changed, 8 insertions(+), 8 deletions(-) diff --git a/sound/soc/fsl/eukrea-tlv320.c b/sound/soc/fsl/eukrea-tlv320.c index 883087f2b092..84ef6385736c 100644 --- a/sound/soc/fsl/eukrea-tlv320.c +++ b/sound/soc/fsl/eukrea-tlv320.c @@ -64,7 +64,7 @@ static int eukrea_tlv320_hw_params(struct snd_pcm_substream *substream, return 0; } -static struct snd_soc_ops eukrea_tlv320_snd_ops = { +static const struct snd_soc_ops eukrea_tlv320_snd_ops = { .hw_params = eukrea_tlv320_hw_params, }; diff --git a/sound/soc/fsl/imx-mc13783.c b/sound/soc/fsl/imx-mc13783.c index bb0459018b45..9d19b808f634 100644 --- a/sound/soc/fsl/imx-mc13783.c +++ b/sound/soc/fsl/imx-mc13783.c @@ -48,7 +48,7 @@ static int imx_mc13783_hifi_hw_params(struct snd_pcm_substream *substream, return snd_soc_dai_set_tdm_slot(cpu_dai, 0x3, 0x3, 2, 16); } -static struct snd_soc_ops imx_mc13783_hifi_ops = { +static const struct snd_soc_ops imx_mc13783_hifi_ops = { .hw_params = imx_mc13783_hifi_hw_params, }; diff --git a/sound/soc/fsl/mpc8610_hpcd.c b/sound/soc/fsl/mpc8610_hpcd.c index ddf49f30b23f..a639b52c16f6 100644 --- a/sound/soc/fsl/mpc8610_hpcd.c +++ b/sound/soc/fsl/mpc8610_hpcd.c @@ -174,7 +174,7 @@ static int mpc8610_hpcd_machine_remove(struct snd_soc_card *card) /** * mpc8610_hpcd_ops: ASoC machine driver operations */ -static struct snd_soc_ops mpc8610_hpcd_ops = { +static const struct snd_soc_ops mpc8610_hpcd_ops = { .startup = mpc8610_hpcd_startup, }; diff --git a/sound/soc/fsl/mx27vis-aic32x4.c b/sound/soc/fsl/mx27vis-aic32x4.c index 198eeb3f3f7a..d7ec3d20065c 100644 --- a/sound/soc/fsl/mx27vis-aic32x4.c +++ b/sound/soc/fsl/mx27vis-aic32x4.c @@ -73,7 +73,7 @@ static int mx27vis_aic32x4_hw_params(struct snd_pcm_substream *substream, return 0; } -static struct snd_soc_ops mx27vis_aic32x4_snd_ops = { +static const struct snd_soc_ops mx27vis_aic32x4_snd_ops = { .hw_params = mx27vis_aic32x4_hw_params, }; diff --git a/sound/soc/fsl/p1022_ds.c b/sound/soc/fsl/p1022_ds.c index a1f780ecadf5..41c623c55c16 100644 --- a/sound/soc/fsl/p1022_ds.c +++ b/sound/soc/fsl/p1022_ds.c @@ -184,7 +184,7 @@ static int p1022_ds_machine_remove(struct snd_soc_card *card) /** * p1022_ds_ops: ASoC machine driver operations */ -static struct snd_soc_ops p1022_ds_ops = { +static const struct snd_soc_ops p1022_ds_ops = { .startup = p1022_ds_startup, }; diff --git a/sound/soc/fsl/p1022_rdk.c b/sound/soc/fsl/p1022_rdk.c index d4d88a8cb9c0..4afbdd610bfa 100644 --- a/sound/soc/fsl/p1022_rdk.c +++ b/sound/soc/fsl/p1022_rdk.c @@ -188,7 +188,7 @@ static int p1022_rdk_machine_remove(struct snd_soc_card *card) /** * p1022_rdk_ops: ASoC machine driver operations */ -static struct snd_soc_ops p1022_rdk_ops = { +static const struct snd_soc_ops p1022_rdk_ops = { .startup = p1022_rdk_startup, }; diff --git a/sound/soc/fsl/phycore-ac97.c b/sound/soc/fsl/phycore-ac97.c index ae403c29688f..66fb6c4614d2 100644 --- a/sound/soc/fsl/phycore-ac97.c +++ b/sound/soc/fsl/phycore-ac97.c @@ -23,7 +23,7 @@ static struct
Re: ioctl structs differ from x86_64?
On Tue, Mar 14, 2017 at 10:37:44AM +, Harshal Patil wrote: Our guess is the ioctls in ppc64le differ than x86_64, and thats why the code which is clearing onclr bit ([4]https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR164) is failing on ppc64le but works fine on x86_64. Any pointers on the possible solution would be really helpful. This looks like a bug in Go. The syscall.TCGETS and syscall.TCSETS constants have the wrong values. They were generated using the glibc termios struct instead of the kernel termios struct. It's the issue described here: https://groups.google.com/forum/#!topic/golang-nuts/K5NoG8slez0 Things work if you replace syscall.TCGETS with 0x402c7413 and syscall.TCSETS with 0x802c7414, the correct values on ppc64le. -- Reza Arbab
[PATCH v3 4/7] macintosh: Only descend into directory when CONFIG_MACINTOSH_DRIVERS is set
When CONFIG_MACINTOSH_DRIVERS is not set make will still descend into the macintosh directory but nothing will be built. This produces unneeded build artifacts and messages in addition to slowing the build. Fix this here. Signed-off-by: Andrew F. Davis--- drivers/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/Makefile b/drivers/Makefile index 9cf52524ecab..cd92491bd76b 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -78,7 +78,7 @@ obj-$(CONFIG_LIBNVDIMM) += nvdimm/ obj-$(CONFIG_DEV_DAX) += dax/ obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ obj-$(CONFIG_NUBUS)+= nubus/ -obj-y += macintosh/ +obj-$(CONFIG_MACINTOSH_DRIVERS)+= macintosh/ obj-$(CONFIG_IDE) += ide/ obj-$(CONFIG_SCSI) += scsi/ obj-y += nvme/ -- 2.11.0
[PATCH v3 6/7] lguest: Only descend into lguest directory when CONFIG_LGUEST is set
When CONFIG_LGUEST is not set make will still descend into the lguest directory but nothing will be built. This produces unneeded build artifacts and messages in addition to slowing the build. Fix this here. Signed-off-by: Andrew F. Davis--- drivers/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/Makefile b/drivers/Makefile index 8a9ed5c59778..3d758020d248 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -126,7 +126,7 @@ obj-$(CONFIG_ACCESSIBILITY) += accessibility/ obj-$(CONFIG_ISDN) += isdn/ obj-$(CONFIG_EDAC) += edac/ obj-$(CONFIG_EISA) += eisa/ -obj-y += lguest/ +obj-$(CONFIG_LGUEST) += lguest/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_CPU_IDLE) += cpuidle/ obj-y += mmc/ -- 2.11.0
[PATCH v3 0/7] Remove unneeded build directory traversals
Hello all, I was building a kernel for x86 and noticed Make still descended into directories like drivers/gpu/drm/hisilicon, this seems kind of odd given nothing will be built here. It looks to be due to some directories being included in obj-y unconditionally instead of only when the relevant CONFIG_ is set. These patches are split by subsystem in-case, for some reason, a file in a directory does need to be built, I believe I have checked for all instances of this, but a quick review from some maintainers would be nice. Thanks, Andrew Changes from v2: - Removed patches that would not work - Rebased on v4.11-rc1 (no changes needed) Changes from v1: - Removed patches already taken by maintainers - Rebased on v4.10-rc1 (no changes needed) Andrew F. Davis (7): pwm: Only descend into pwm directory when CONFIG_PWM is set amba: Only descend into amba directory when CONFIG_ARM_AMBA is set NFC: Only descend into nfc directory when CONFIG_NFC is set macintosh: Only descend into directory when CONFIG_MACINTOSH_DRIVERS is set auxdisplay: Only descend into directory when CONFIG_AUXDISPLAY is set lguest: Only descend into lguest directory when CONFIG_LGUEST is set mmc: Only descend into mmc directory when CONFIG_MMC is set drivers/Makefile | 18 +++--- 1 file changed, 11 insertions(+), 7 deletions(-) -- 2.11.0
[PATCH v3 7/7] mmc: Only descend into mmc directory when CONFIG_MMC is set
When CONFIG_MMC is not set make will still descend into the mmc directory but nothing will be built. This produces unneeded build artifacts and messages in addition to slowing the build. Fix this here. Signed-off-by: Andrew F. Davis--- drivers/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/Makefile b/drivers/Makefile index 3d758020d248..03da7a38e989 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -129,7 +129,7 @@ obj-$(CONFIG_EISA) += eisa/ obj-$(CONFIG_LGUEST) += lguest/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_CPU_IDLE) += cpuidle/ -obj-y += mmc/ +obj-$(CONFIG_MMC) += mmc/ obj-$(CONFIG_MEMSTICK) += memstick/ obj-$(CONFIG_NEW_LEDS) += leds/ obj-$(CONFIG_INFINIBAND) += infiniband/ -- 2.11.0
[PATCH v3 3/7] NFC: Only descend into nfc directory when CONFIG_NFC is set
When CONFIG_NFC is not set make will still descend into the nfc directory but nothing will be built. This produces unneeded build artifacts and messages in addition to slowing the build. Fix this here. Signed-off-by: Andrew F. Davis--- drivers/Makefile | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/Makefile b/drivers/Makefile index 23712a92a89a..9cf52524ecab 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -69,7 +69,11 @@ obj-$(CONFIG_FB_INTEL) += video/fbdev/intelfb/ obj-$(CONFIG_PARPORT) += parport/ obj-$(CONFIG_NVM) += lightnvm/ -obj-y += base/ block/ misc/ mfd/ nfc/ +obj-y += base/ +obj-y += block/ +obj-y += misc/ +obj-y += mfd/ +obj-$(CONFIG_NFC) += nfc/ obj-$(CONFIG_LIBNVDIMM)+= nvdimm/ obj-$(CONFIG_DEV_DAX) += dax/ obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ -- 2.11.0
[PATCH v3 2/7] amba: Only descend into amba directory when CONFIG_ARM_AMBA is set
When CONFIG_ARM_AMBA is not set make will still descend into the amba directory but nothing will be built. This produces unneeded build artifacts and messages in addition to slowing the build. Fix this here. Signed-off-by: Andrew F. Davis--- drivers/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/Makefile b/drivers/Makefile index 00d86749a843..23712a92a89a 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -31,7 +31,7 @@ obj-$(CONFIG_SFI) += sfi/ # PnP must come after ACPI since it will eventually need to check if acpi # was used and do nothing if so obj-$(CONFIG_PNP) += pnp/ -obj-y += amba/ +obj-$(CONFIG_ARM_AMBA) += amba/ obj-y += clk/ # Many drivers will want to use DMA so this has to be made available -- 2.11.0
[PATCH v3 1/7] pwm: Only descend into pwm directory when CONFIG_PWM is set
When CONFIG_PWM is not set make will still descend into the pwm directory but nothing will be built. This produces unneeded build artifacts and messages in addition to slowing the build. Fix this here. Signed-off-by: Andrew F. Davis--- drivers/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/Makefile b/drivers/Makefile index 2eced9afba53..00d86749a843 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -13,7 +13,7 @@ obj-$(CONFIG_GENERIC_PHY) += phy/ # GPIO must come after pinctrl as gpios may need to mux pins etc obj-$(CONFIG_PINCTRL) += pinctrl/ obj-$(CONFIG_GPIOLIB) += gpio/ -obj-y += pwm/ +obj-$(CONFIG_PWM) += pwm/ obj-$(CONFIG_PCI) += pci/ # PCI dwc controller drivers obj-y += pci/dwc/ -- 2.11.0
[PATCH v3 5/7] auxdisplay: Only descend into directory when CONFIG_AUXDISPLAY is set
When CONFIG_AUXDISPLAY is not set make will still descend into the auxdisplay directory but nothing will be built. This produces unneeded build artifacts and messages in addition to slowing the build. Fix this here. Signed-off-by: Andrew F. Davis--- drivers/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/Makefile b/drivers/Makefile index cd92491bd76b..8a9ed5c59778 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -95,7 +95,7 @@ obj-y += firewire/ obj-$(CONFIG_UIO) += uio/ obj-$(CONFIG_VFIO) += vfio/ obj-y += cdrom/ -obj-y += auxdisplay/ +obj-$(CONFIG_AUXDISPLAY) += auxdisplay/ obj-$(CONFIG_PCCARD) += pcmcia/ obj-$(CONFIG_DIO) += dio/ obj-$(CONFIG_SBUS) += sbus/ -- 2.11.0
Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
On Thu, 16 Mar 2017 00:21:07 +1100 Alexey Kardashevskiywrote: > On 15/03/17 08:05, Alex Williamson wrote: > > On Fri, 10 Mar 2017 14:53:37 +1100 > > Alexey Kardashevskiy wrote: > > > >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO > >> without passing them to user space which saves time on switching > >> to user space and back. > >> > >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM. > >> KVM tries to handle a TCE request in the real mode, if failed > >> it passes the request to the virtual mode to complete the operation. > >> If it a virtual mode handler fails, the request is passed to > >> the user space; this is not expected to happen though. > >> > >> To avoid dealing with page use counters (which is tricky in real mode), > >> this only accelerates SPAPR TCE IOMMU v2 clients which are required > >> to pre-register the userspace memory. The very first TCE request will > >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view > >> of the TCE table (iommu_table::it_userspace) is not allocated till > >> the very first mapping happens and we cannot call vmalloc in real mode. > >> > >> If we fail to update a hardware IOMMU table unexpected reason, we just > >> clear it and move on as there is nothing really we can do about it - > >> for example, if we hot plug a VFIO device to a guest, existing TCE tables > >> will be mirrored automatically to the hardware and there is no interface > >> to report to the guest about possible failures. > >> > >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to > >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd > >> and associates a physical IOMMU table with the SPAPR TCE table (which > >> is a guest view of the hardware IOMMU table). The iommu_table object > >> is cached and referenced so we do not have to look up for it in real mode. > >> > >> This does not implement the UNSET counterpart as there is no use for it - > >> once the acceleration is enabled, the existing userspace won't > >> disable it unless a VFIO container is destroyed; this adds necessary > >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler. > >> > >> As this creates a descriptor per IOMMU table-LIOBN couple (called > >> kvmppc_spapr_tce_iommu_table), it is possible to have several > >> descriptors with the same iommu_table (hardware IOMMU table) attached > >> to the same LIOBN; we do not remove duplicates though as > >> iommu_table_ops::exchange not just update a TCE entry (which is > >> shared among IOMMU groups) but also invalidates the TCE cache > >> (one per IOMMU group). > >> > >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user > >> space. > >> > >> This adds real mode version of WARN_ON_ONCE() as the generic version > >> causes problems with rcu_sched. Since we testing what vmalloc_to_phys() > >> returns in the code, this also adds a check for already existing > >> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect(). > >> > >> This finally makes use of vfio_external_user_iommu_id() which was > >> introduced quite some time ago and was considered for removal. > >> > >> Tests show that this patch increases transmission speed from 220MB/s > >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > >> > >> Signed-off-by: Alexey Kardashevskiy > >> --- > >> Changes: > >> v8: > >> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed > >> to handle them > >> * changed vmalloc_to_phys() callers to return H_HARDWARE > >> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD > >> and added a comment about this in the code > >> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE > >> and do WARN_ON > >> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to > >> have all vmalloc_to_phys() callsites covered > >> > >> v7: > >> * added realmode-friendly WARN_ON_ONCE_RM > >> > >> v6: > >> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map() > >> * moved kvmppc_gpa_to_ua() to TCE validation > >> > >> v5: > >> * changed error codes in multiple places > >> * added bunch of WARN_ON() in places which should not really happen > >> * adde a check that an iommu table is not attached already to LIOBN > >> * dropped explicit calls to iommu_tce_clear_param_check/ > >> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate > >> call them anyway (since the previous patch) > >> * if we fail to update a hardware IOMMU table for unexpected reason, > >> this just clears the entry > >> > >> v4: > >> * added note to the commit log about allowing multiple updates of > >> the same IOMMU table; > >> * instead of checking for if any memory was preregistered, this > >> returns H_TOO_HARD if a specific page was not; > >> * fixed comments from v3 about error handling in many places;
Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
On Wed, 15 Mar 2017 15:40:14 +1100 David Gibsonwrote: > > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c > > > b/arch/powerpc/kvm/book3s_64_vio.c > > > index e96a4590464c..be18cda01e1b 100644 > > > --- a/arch/powerpc/kvm/book3s_64_vio.c > > > +++ b/arch/powerpc/kvm/book3s_64_vio.c > > > @@ -28,6 +28,10 @@ > > > #include > > > #include > > > #include > > > +#include > > > +#include > > > +#include > > > +#include > > > > > > #include > > > #include > > > @@ -40,6 +44,36 @@ > > > #include > > > #include > > > #include > > > +#include > > > + > > > +static void kvm_vfio_group_put_external_user(struct vfio_group > > > *vfio_group) > > > +{ > > > + void (*fn)(struct vfio_group *); > > > + > > > + fn = symbol_get(vfio_group_put_external_user); > > > + if (WARN_ON(!fn)) > > > + return; > > > + > > > + fn(vfio_group); > > > + > > > + symbol_put(vfio_group_put_external_user); > > > +} > > > + > > > +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group) > > > +{ > > > + int (*fn)(struct vfio_group *); > > > + int ret = -1; > > > + > > > + fn = symbol_get(vfio_external_user_iommu_id); > > > + if (!fn) > > > + return ret; > > > + > > > + ret = fn(vfio_group); > > > + > > > + symbol_put(vfio_external_user_iommu_id); > > > + > > > + return ret; > > > +} > > > > > > Ugh. This feels so wrong. Why can't you have kvm-vfio pass the > > iommu_group? Why do you need to hold this additional vfio_group > > reference? > > Keeping the vfio_group reference makes sense to me, since we don't > want the vfio context for the group to go away while it's attached to > the LIOBN. But there's already a reference for that, it's taken by KVM_DEV_VFIO_GROUP_ADD and held until KVM_DEV_VFIO_GROUP_DEL. Both the DEL path and the cleanup path call kvm_spapr_tce_release_iommu_group() before releasing that reference, so it seems entirely redundant. > However, going via the iommu_id rather than just having an interface > to directly grab the iommu group from the vfio_group seems bizarre to > me. I'm ok with cleaning that up later, however. We have kvm_spapr_tce_attach_iommu_group() and kvm_spapr_tce_release_iommu_group(), but both take a vfio_group, not an iommu_group as a parameter. I don't particularly have a problem with the vfio_group -> iommu ID -> iommu_group, but if we drop the extra vfio_group reference and pass the iommu_group itself to these functions then we can keep all the symbol reference stuff in the kvm-vfio glue layer. Thanks, Alex
RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm
From: Linuxppc-dev Daniel Axtens > Sent: 15 March 2017 12:38 > The core nuts and bolts of the crc32c vpmsum algorithm will > also work for a number of other CRC algorithms with different > polynomials. Factor out the function into a new asm file. > > To handle multiple users of the function, a user simply > provides constants, defines the name of their CRC function, > and then #includes the core algorithm file. ... While not part of this change, the unrolled loops look as though they just destroy the cpu cache. I'd like be convinced that anything does CRC over long enough buffers to make it a gain at all. With modern (not that modern now) superscalar cpus you can often get the loop instructions 'for free'. Sometimes pipelining the loop is needed to get full throughput. Unlike the IP checksum, you don't even have to 'loop carry' the cpu carry flag. David
[PATCH] powerpc/pasemi, cbe: Do not process decremeter or external wakeup from powersave
Hi, I would like to start using a dedicated stack for system reset interrupt and treat it as a Linux nmi, which makes it tricky to call complex interrupt handlers directly from the system reset trap handler. So I would like to remove the decrementer and external handler calls from Cell and Pasemi platforms' system reset handler. I think we can just remove them if they can be handled when they re-fire as normal interrupts? At the moment I don't have environments set up to test if this works. This will make the powersave wakeup a bit slower. We could add alternate code to IDLETEST here to branch out to powersave wakeup handler like powernv does. I haven't got a patch for that yet but I could help write one if it is important and it can be tested. Thanks, Nick --- arch/powerpc/platforms/cell/pervasive.c | 11 +++ arch/powerpc/platforms/pasemi/idle.c| 11 +++ 2 files changed, 14 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/platforms/cell/pervasive.c b/arch/powerpc/platforms/cell/pervasive.c index e7d075077cb0..a88944db9fc3 100644 --- a/arch/powerpc/platforms/cell/pervasive.c +++ b/arch/powerpc/platforms/cell/pervasive.c @@ -88,11 +88,14 @@ static void cbe_power_save(void) static int cbe_system_reset_exception(struct pt_regs *regs) { switch (regs->msr & SRR1_WAKEMASK) { - case SRR1_WAKEEE: - do_IRQ(regs); - break; case SRR1_WAKEDEC: - timer_interrupt(regs); + set_dec(1); + case SRR1_WAKEEE: + /* +* Handle these when interrupts get re-enabled and we take +* them as regular exceptions. We are in an NMI context +* and can't handle these here. +*/ break; case SRR1_WAKEMT: return cbe_sysreset_hack(); diff --git a/arch/powerpc/platforms/pasemi/idle.c b/arch/powerpc/platforms/pasemi/idle.c index 75b296bc51af..44e0d9226f0a 100644 --- a/arch/powerpc/platforms/pasemi/idle.c +++ b/arch/powerpc/platforms/pasemi/idle.c @@ -53,11 +53,14 @@ static int pasemi_system_reset_exception(struct pt_regs *regs) regs->nip = regs->link; switch (regs->msr & SRR1_WAKEMASK) { - case SRR1_WAKEEE: - do_IRQ(regs); - break; case SRR1_WAKEDEC: - timer_interrupt(regs); + set_dec(1); + case SRR1_WAKEEE: + /* +* Handle these when interrupts get re-enabled and we take +* them as regular exceptions. We are in an NMI context +* and can't handle these here. +*/ break; default: /* do system reset */ -- 2.11.0
Re: ioctl structs differ from x86_64?
On Tue, Mar 14, 2017 at 11:37 AM, Harshal Patilwrote: > Hello, > > I am looking into a bug, > https://bugzilla.linux.ibm.com/show_bug.cgi?id=152493 ( external mirror is > at, https://github.com/opencontainers/runc/issues/1364) > > Recently in runc code, they added this code > https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR155. > As you can see they set -onlcr to get rid of \r (line no. 164). Golang, in > which runc is written, doesn't have any bindings for ioctls. This means you > have to invoke C code directly (that's what they are doing there). > > Our guess is the ioctls in ppc64le differ than x86_64, and thats why the > code which is clearing onclr bit > (https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR164) > is failing on ppc64le but works fine on x86_64. > > Any pointers on the possible solution would be really helpful. There are a couple of reasons for ioctl numbers to differ: - like this one, the command number may be defined in asm/ioctls.h. Most architectures in the kernel are the same, some (arm, blackfin, frv, m68k, and s390) only differ in FIOQSIZE, others (alpha, mips, parisc, powerpc, sh, sparc, and xtensa) redefine all the traditional file (FIO*) and tty (TC*, TIO*) commands - command numbers that are defined in terms of structure sizes depend on the architectures type definitions (e.g. long, off_t, uid_t, ...) that can differ in both size and alignment - alpha, mips, powerpc and sparc use _IOC_SIZEBITS==13, everything else uses _IOC_SIZEBITS==14, and that in turn means that the majority of ioctl commands are different between those four and the rest. Arnd
Re: [PATCH v3 1/2] powerpc: split ftrace bits into a separate file
On 2017/03/15 09:11AM, Steven Rostedt wrote: > On Wed, 15 Mar 2017 14:35:16 +0530 > "Naveen N. Rao"wrote: > > > I don't have a strong opinion about this, but I feel that x86 can simply > > use ftrace_64.S, seeing as the current name is mcount_64.S. > > > > Other architectures could do something similar too, or fall back to > > ftrace_hook.S. That way, all ftrace low-level code can simply be > > referred to as arch/*/ftrace_*.S > > Just to clarify, I'm currently working on patches to clean up the > ftrace.S code in x86, and I renamed mcount_64.S to ftrace_64.S. I'm > also moving the ftrace code out of entry_32.S into a ftrace_32.S file > as well. Patches will hopefully be posted soon. Good to know, thanks for clarifying. In light of that, I hope that you're ok with the changes proposed to the ftrace bits under arch/powerpc in this patchset? - Naveen
Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
On 15/03/17 08:05, Alex Williamson wrote: > On Fri, 10 Mar 2017 14:53:37 +1100 > Alexey Kardashevskiywrote: > >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO >> without passing them to user space which saves time on switching >> to user space and back. >> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM. >> KVM tries to handle a TCE request in the real mode, if failed >> it passes the request to the virtual mode to complete the operation. >> If it a virtual mode handler fails, the request is passed to >> the user space; this is not expected to happen though. >> >> To avoid dealing with page use counters (which is tricky in real mode), >> this only accelerates SPAPR TCE IOMMU v2 clients which are required >> to pre-register the userspace memory. The very first TCE request will >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view >> of the TCE table (iommu_table::it_userspace) is not allocated till >> the very first mapping happens and we cannot call vmalloc in real mode. >> >> If we fail to update a hardware IOMMU table unexpected reason, we just >> clear it and move on as there is nothing really we can do about it - >> for example, if we hot plug a VFIO device to a guest, existing TCE tables >> will be mirrored automatically to the hardware and there is no interface >> to report to the guest about possible failures. >> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd >> and associates a physical IOMMU table with the SPAPR TCE table (which >> is a guest view of the hardware IOMMU table). The iommu_table object >> is cached and referenced so we do not have to look up for it in real mode. >> >> This does not implement the UNSET counterpart as there is no use for it - >> once the acceleration is enabled, the existing userspace won't >> disable it unless a VFIO container is destroyed; this adds necessary >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler. >> >> As this creates a descriptor per IOMMU table-LIOBN couple (called >> kvmppc_spapr_tce_iommu_table), it is possible to have several >> descriptors with the same iommu_table (hardware IOMMU table) attached >> to the same LIOBN; we do not remove duplicates though as >> iommu_table_ops::exchange not just update a TCE entry (which is >> shared among IOMMU groups) but also invalidates the TCE cache >> (one per IOMMU group). >> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user >> space. >> >> This adds real mode version of WARN_ON_ONCE() as the generic version >> causes problems with rcu_sched. Since we testing what vmalloc_to_phys() >> returns in the code, this also adds a check for already existing >> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect(). >> >> This finally makes use of vfio_external_user_iommu_id() which was >> introduced quite some time ago and was considered for removal. >> >> Tests show that this patch increases transmission speed from 220MB/s >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). >> >> Signed-off-by: Alexey Kardashevskiy >> --- >> Changes: >> v8: >> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed >> to handle them >> * changed vmalloc_to_phys() callers to return H_HARDWARE >> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD >> and added a comment about this in the code >> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE >> and do WARN_ON >> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to >> have all vmalloc_to_phys() callsites covered >> >> v7: >> * added realmode-friendly WARN_ON_ONCE_RM >> >> v6: >> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map() >> * moved kvmppc_gpa_to_ua() to TCE validation >> >> v5: >> * changed error codes in multiple places >> * added bunch of WARN_ON() in places which should not really happen >> * adde a check that an iommu table is not attached already to LIOBN >> * dropped explicit calls to iommu_tce_clear_param_check/ >> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate >> call them anyway (since the previous patch) >> * if we fail to update a hardware IOMMU table for unexpected reason, >> this just clears the entry >> >> v4: >> * added note to the commit log about allowing multiple updates of >> the same IOMMU table; >> * instead of checking for if any memory was preregistered, this >> returns H_TOO_HARD if a specific page was not; >> * fixed comments from v3 about error handling in many places; >> * simplified TCE handlers and merged IOMMU parts inline - for example, >> there used to be kvmppc_h_put_tce_iommu(), now it is merged into >> kvmppc_h_put_tce(); this allows to check IOBA boundaries against >> the first attached table only (makes the code simpler); >> >> v3:
Re: [PATCH v3 1/2] powerpc: split ftrace bits into a separate file
On Wed, 15 Mar 2017 14:35:16 +0530 "Naveen N. Rao"wrote: > I don't have a strong opinion about this, but I feel that x86 can simply > use ftrace_64.S, seeing as the current name is mcount_64.S. > > Other architectures could do something similar too, or fall back to > ftrace_hook.S. That way, all ftrace low-level code can simply be > referred to as arch/*/ftrace_*.S Just to clarify, I'm currently working on patches to clean up the ftrace.S code in x86, and I renamed mcount_64.S to ftrace_64.S. I'm also moving the ftrace code out of entry_32.S into a ftrace_32.S file as well. Patches will hopefully be posted soon. -- Steve
[PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm
The core nuts and bolts of the crc32c vpmsum algorithm will also work for a number of other CRC algorithms with different polynomials. Factor out the function into a new asm file. To handle multiple users of the function, a user simply provides constants, defines the name of their CRC function, and then #includes the core algorithm file. Cc: Anton BlanchardSigned-off-by: Daniel Axtens -- It's possible at this point to argue that the address of the constant tables should be passed in to the function, rather than doing this somewhat unconventional #include. However, we're about to add further #ifdef's back into the core that will be provided by the encapsulaing code, and which couldn't be done as a variable without performance loss. --- arch/powerpc/crypto/crc32-vpmsum_core.S | 726 arch/powerpc/crypto/crc32c-vpmsum_asm.S | 714 +-- 2 files changed, 729 insertions(+), 711 deletions(-) create mode 100644 arch/powerpc/crypto/crc32-vpmsum_core.S diff --git a/arch/powerpc/crypto/crc32-vpmsum_core.S b/arch/powerpc/crypto/crc32-vpmsum_core.S new file mode 100644 index ..629244ef170e --- /dev/null +++ b/arch/powerpc/crypto/crc32-vpmsum_core.S @@ -0,0 +1,726 @@ +/* + * Core of the accelerated CRC algorithm. + * In your file, define the constants and CRC_FUNCTION_NAME + * Then include this file. + * + * Calculate the checksum of data that is 16 byte aligned and a multiple of + * 16 bytes. + * + * The first step is to reduce it to 1024 bits. We do this in 8 parallel + * chunks in order to mask the latency of the vpmsum instructions. If we + * have more than 32 kB of data to checksum we repeat this step multiple + * times, passing in the previous 1024 bits. + * + * The next step is to reduce the 1024 bits to 64 bits. This step adds + * 32 bits of 0s to the end - this matches what a CRC does. We just + * calculate constants that land the data in this 32 bits. + * + * We then use fixed point Barrett reduction to compute a mod n over GF(2) + * for n = CRC using POWER8 instructions. We use x = 32. + * + * http://en.wikipedia.org/wiki/Barrett_reduction + * + * Copyright (C) 2015 Anton Blanchard , IBM + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. +*/ + +#include +#include + +#define MAX_SIZE 32768 + + .text + +#if defined(__BIG_ENDIAN__) +#define BYTESWAP_DATA +#else +#undef BYTESWAP_DATA +#endif + +#define off16 r25 +#define off32 r26 +#define off48 r27 +#define off64 r28 +#define off80 r29 +#define off96 r30 +#define off112 r31 + +#define const1 v24 +#define const2 v25 + +#define byteswap v26 +#definemask_32bit v27 +#definemask_64bit v28 +#define zeroes v29 + +#ifdef BYTESWAP_DATA +#define VPERM(A, B, C, D) vpermA, B, C, D +#else +#define VPERM(A, B, C, D) +#endif + +/* unsigned int CRC_FUNCTION_NAME(unsigned int crc, void *p, unsigned long len) */ +FUNC_START(CRC_FUNCTION_NAME) + std r31,-8(r1) + std r30,-16(r1) + std r29,-24(r1) + std r28,-32(r1) + std r27,-40(r1) + std r26,-48(r1) + std r25,-56(r1) + + li off16,16 + li off32,32 + li off48,48 + li off64,64 + li off80,80 + li off96,96 + li off112,112 + li r0,0 + + /* Enough room for saving 10 non volatile VMX registers */ + subir6,r1,56+10*16 + subir7,r1,56+2*16 + + stvxv20,0,r6 + stvxv21,off16,r6 + stvxv22,off32,r6 + stvxv23,off48,r6 + stvxv24,off64,r6 + stvxv25,off80,r6 + stvxv26,off96,r6 + stvxv27,off112,r6 + stvxv28,0,r7 + stvxv29,off16,r7 + + mr r10,r3 + + vxorzeroes,zeroes,zeroes + vspltisw v0,-1 + + vsldoi mask_32bit,zeroes,v0,4 + vsldoi mask_64bit,zeroes,v0,8 + + /* Get the initial value into v8 */ + vxorv8,v8,v8 + MTVRD(v8, R3) + vsldoi v8,zeroes,v8,8 /* shift into bottom 32 bits */ + +#ifdef BYTESWAP_DATA + addis r3,r2,.byteswap_constant@toc@ha + addir3,r3,.byteswap_constant@toc@l + + lvx byteswap,0,r3 + addir3,r3,16 +#endif + + cmpdi r5,256 + blt .Lshort + + rldicr r6,r5,0,56 + + /* Checksum in blocks of MAX_SIZE */ +1: lis r7,MAX_SIZE@h + ori r7,r7,MAX_SIZE@l + mr r9,r7 + cmpdr6,r7 + bgt 2f + mr r7,r6 +2: subfr6,r7,r6 + + /* our main loop does 128 bytes at a time */ + srdi
[PATCH 4/4] crypto: powerpc - Stress test for vpmsum implementations
vpmsum implementations often don't kick in for short test vectors. This is a simple test module that does a configurable number of random tests, each up to 64kB and each with random offsets. Both CRC-T10DIF and CRC32C are tested. Cc: Anton BlanchardSigned-off-by: Daniel Axtens -- Not super fussy about the inclusion or otherwise of this - it was very useful for debugging my code, and more tests are good :) Also, I originally found the bug in Anton's CRC32c using this. Tests pass on both BE 64 bit and LE 64 bit. --- arch/powerpc/crypto/Makefile | 1 + arch/powerpc/crypto/crc-vpmsum_test.c | 137 ++ crypto/Kconfig| 8 ++ 3 files changed, 146 insertions(+) create mode 100644 arch/powerpc/crypto/crc-vpmsum_test.c diff --git a/arch/powerpc/crypto/Makefile b/arch/powerpc/crypto/Makefile index e66aaf19764d..67eca3af9fc7 100644 --- a/arch/powerpc/crypto/Makefile +++ b/arch/powerpc/crypto/Makefile @@ -11,6 +11,7 @@ obj-$(CONFIG_CRYPTO_SHA1_PPC_SPE) += sha1-ppc-spe.o obj-$(CONFIG_CRYPTO_SHA256_PPC_SPE) += sha256-ppc-spe.o obj-$(CONFIG_CRYPTO_CRC32C_VPMSUM) += crc32c-vpmsum.o obj-$(CONFIG_CRYPTO_CRCT10DIF_VPMSUM) += crct10dif-vpmsum.o +obj-$(CONFIG_CRYPTO_VPMSUM_TESTER) += crc-vpmsum_test.o aes-ppc-spe-y := aes-spe-core.o aes-spe-keys.o aes-tab-4k.o aes-spe-modes.o aes-spe-glue.o md5-ppc-y := md5-asm.o md5-glue.o diff --git a/arch/powerpc/crypto/crc-vpmsum_test.c b/arch/powerpc/crypto/crc-vpmsum_test.c new file mode 100644 index ..d58242557f33 --- /dev/null +++ b/arch/powerpc/crypto/crc-vpmsum_test.c @@ -0,0 +1,137 @@ +/* + * CRC vpmsum tester + * Copyright 2017 Daniel Axtens, IBM Corporation. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static unsigned long iterations = 1; + +#define MAX_CRC_LENGTH 65535 + + +static int __init crc_test_init(void) +{ + u16 crc16 = 0, verify16 = 0; + u32 crc32 = 0, verify32 = 0; + __le32 verify32le = 0; + unsigned char *data; + unsigned long i; + int ret; + + struct crypto_shash *crct10dif_tfm; + struct crypto_shash *crc32c_tfm; + + if (!cpu_has_feature(CPU_FTR_ARCH_207S)) + return -ENODEV; + + data = kmalloc(MAX_CRC_LENGTH, GFP_KERNEL); + if (!data) + return -ENOMEM; + + crct10dif_tfm = crypto_alloc_shash("crct10dif", 0, 0); + + if (IS_ERR(crct10dif_tfm)) { + pr_err("Error allocating crc-t10dif\n"); + goto free_buf; + } + + crc32c_tfm = crypto_alloc_shash("crc32c", 0, 0); + + if (IS_ERR(crc32c_tfm)) { + pr_err("Error allocating crc32c\n"); + goto free_16; + } + + do { + SHASH_DESC_ON_STACK(crct10dif_shash, crct10dif_tfm); + SHASH_DESC_ON_STACK(crc32c_shash, crc32c_tfm); + + crct10dif_shash->tfm = crct10dif_tfm; + ret = crypto_shash_init(crct10dif_shash); + + if (ret) { + pr_err("Error initing crc-t10dif\n"); + goto free_32; + } + + + crc32c_shash->tfm = crc32c_tfm; + ret = crypto_shash_init(crc32c_shash); + + if (ret) { + pr_err("Error initing crc32c\n"); + goto free_32; + } + + pr_info("crc-vpmsum_test begins, %lu iterations\n", iterations); + for (i=0; i
[PATCH 3/4] crypto: powerpc - Add CRC-T10DIF acceleration
T10DIF is a CRC16 used heavily in NVMe. It turns out we can accelerate it with a CRC32 library and a few little tricks. Provide the accelerator based the refactored CRC32 code. Cc: Anton BlanchardThanks-to: Hong Bo Peng Signed-off-by: Daniel Axtens --- arch/powerpc/crypto/Makefile| 2 + arch/powerpc/crypto/crct10dif-vpmsum_asm.S | 850 arch/powerpc/crypto/crct10dif-vpmsum_glue.c | 125 crypto/Kconfig | 9 + 4 files changed, 986 insertions(+) create mode 100644 arch/powerpc/crypto/crct10dif-vpmsum_asm.S create mode 100644 arch/powerpc/crypto/crct10dif-vpmsum_glue.c diff --git a/arch/powerpc/crypto/Makefile b/arch/powerpc/crypto/Makefile index 87f40454bad3..e66aaf19764d 100644 --- a/arch/powerpc/crypto/Makefile +++ b/arch/powerpc/crypto/Makefile @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_PPC) += sha1-powerpc.o obj-$(CONFIG_CRYPTO_SHA1_PPC_SPE) += sha1-ppc-spe.o obj-$(CONFIG_CRYPTO_SHA256_PPC_SPE) += sha256-ppc-spe.o obj-$(CONFIG_CRYPTO_CRC32C_VPMSUM) += crc32c-vpmsum.o +obj-$(CONFIG_CRYPTO_CRCT10DIF_VPMSUM) += crct10dif-vpmsum.o aes-ppc-spe-y := aes-spe-core.o aes-spe-keys.o aes-tab-4k.o aes-spe-modes.o aes-spe-glue.o md5-ppc-y := md5-asm.o md5-glue.o @@ -17,3 +18,4 @@ sha1-powerpc-y := sha1-powerpc-asm.o sha1.o sha1-ppc-spe-y := sha1-spe-asm.o sha1-spe-glue.o sha256-ppc-spe-y := sha256-spe-asm.o sha256-spe-glue.o crc32c-vpmsum-y := crc32c-vpmsum_asm.o crc32c-vpmsum_glue.o +crct10dif-vpmsum-y := crct10dif-vpmsum_asm.o crct10dif-vpmsum_glue.o diff --git a/arch/powerpc/crypto/crct10dif-vpmsum_asm.S b/arch/powerpc/crypto/crct10dif-vpmsum_asm.S new file mode 100644 index ..5e3d81a0af1b --- /dev/null +++ b/arch/powerpc/crypto/crct10dif-vpmsum_asm.S @@ -0,0 +1,850 @@ +/* + * Calculate a CRC T10DIF with vpmsum acceleration + * + * Constants generated by crc32-vpmsum, available at + * https://github.com/antonblanchard/crc32-vpmsum + * + * crc32-vpmsum is + * Copyright (C) 2015 Anton Blanchard , IBM + * and is available under the GPL v2 or later. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + .section.rodata +.balign 16 + +.byteswap_constant: + /* byte reverse permute constant */ + .octa 0x0F0E0D0C0B0A09080706050403020100 + +.constants: + + /* Reduce 262144 kbits to 1024 bits */ + /* x^261184 mod p(x), x^261120 mod p(x) */ + .octa 0x56d35255 + + /* x^260160 mod p(x), x^260096 mod p(x) */ + .octa 0xee67a1e4 + + /* x^259136 mod p(x), x^259072 mod p(x) */ + .octa 0x60834ad1 + + /* x^258112 mod p(x), x^258048 mod p(x) */ + .octa 0x8cfe9ab4 + + /* x^257088 mod p(x), x^257024 mod p(x) */ + .octa 0x3e93fdb5 + + /* x^256064 mod p(x), x^256000 mod p(x) */ + .octa 0x3c204548 + + /* x^255040 mod p(x), x^254976 mod p(x) */ + .octa 0xb1fc8d69 + + /* x^254016 mod p(x), x^253952 mod p(x) */ + .octa 0xf82b24ad + + /* x^252992 mod p(x), x^252928 mod p(x) */ + .octa 0x44429f1a + + /* x^251968 mod p(x), x^251904 mod p(x) */ + .octa 0xe88c66ec + + /* x^250944 mod p(x), x^250880 mod p(x) */ + .octa 0x385cc87d + + /* x^249920 mod p(x), x^249856 mod p(x) */ + .octa 0x3227c8ff + + /* x^248896 mod p(x), x^248832 mod p(x) */ + .octa 0xa9a93344 + + /* x^247872 mod p(x), x^247808 mod p(x) */ + .octa 0xabaa66eb + + /* x^246848 mod p(x), x^246784 mod p(x) */ + .octa 0x1ac3c4ef + + /* x^245824 mod p(x), x^245760 mod p(x) */ + .octa 0x63f056f3 + + /* x^244800 mod p(x), x^244736 mod p(x) */ + .octa 0x32cc0205 + + /* x^243776 mod p(x), x^243712 mod p(x) */ + .octa 0xf8b5568e + + /* x^242752 mod p(x), x^242688 mod p(x) */ + .octa 0x8db16429 + + /* x^241728 mod p(x), x^241664 mod p(x) */ + .octa 0x59ca6b66 + + /* x^240704 mod p(x), x^240640 mod p(x) */ + .octa 0x5f5c18f8 + + /* x^239680 mod p(x), x^239616 mod p(x) */ + .octa 0x61afb609 + + /* x^238656 mod p(x), x^238592 mod p(x) */ + .octa 0xe29e099a + +
[PATCH 2/4] crypto: powerpc - Re-enable non-REFLECTed CRCs
When CRC32c was included in the kernel, Anton ripped out the #ifdefs around reflected polynomials, because CRC32c is always reflected. However, not all CRCs use reflection so we'd like to make it optional. Restore the REFLECT parts from Anton's original CRC32 implementation (https://github.com/antonblanchard/crc32-vpmsum) That implementation is available under GPLv2+, so we're OK from a licensing point of view: https://github.com/antonblanchard/crc32-vpmsum/blob/master/LICENSE.TXT As CRC32c requires REFLECT, add that #define. Cc: Anton BlanchardSigned-off-by: Daniel Axtens --- I compared the disassembly of the CRC32c module on LE before and after the change, and verified that they were the same. I verified that the crypto self-tests still pass on LE and BE, and my tests in patch 4 still pass as well. --- arch/powerpc/crypto/crc32-vpmsum_core.S | 31 ++- arch/powerpc/crypto/crc32c-vpmsum_asm.S | 1 + 2 files changed, 31 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/crypto/crc32-vpmsum_core.S b/arch/powerpc/crypto/crc32-vpmsum_core.S index 629244ef170e..87fabf4d391a 100644 --- a/arch/powerpc/crypto/crc32-vpmsum_core.S +++ b/arch/powerpc/crypto/crc32-vpmsum_core.S @@ -35,7 +35,9 @@ .text -#if defined(__BIG_ENDIAN__) +#if defined(__BIG_ENDIAN__) && defined(REFLECT) +#define BYTESWAP_DATA +#elif defined(__LITTLE_ENDIAN__) && !defined(REFLECT) #define BYTESWAP_DATA #else #undef BYTESWAP_DATA @@ -108,7 +110,11 @@ FUNC_START(CRC_FUNCTION_NAME) /* Get the initial value into v8 */ vxorv8,v8,v8 MTVRD(v8, R3) +#ifdef REFLECT vsldoi v8,zeroes,v8,8 /* shift into bottom 32 bits */ +#else + vsldoi v8,v8,zeroes,4 /* shift into top 32 bits */ +#endif #ifdef BYTESWAP_DATA addis r3,r2,.byteswap_constant@toc@ha @@ -354,6 +360,7 @@ FUNC_START(CRC_FUNCTION_NAME) vxorv6,v6,v14 vxorv7,v7,v15 +#ifdef REFLECT /* * vpmsumd produces a 96 bit result in the least significant bits * of the register. Since we are bit reflected we have to shift it @@ -368,6 +375,7 @@ FUNC_START(CRC_FUNCTION_NAME) vsldoi v5,v5,zeroes,4 vsldoi v6,v6,zeroes,4 vsldoi v7,v7,zeroes,4 +#endif /* xor with last 1024 bits */ lvx v8,0,r4 @@ -511,13 +519,33 @@ FUNC_START(CRC_FUNCTION_NAME) vsldoi v1,v0,v0,8 vxorv0,v0,v1/* xor two 64 bit results together */ +#ifdef REFLECT /* shift left one bit */ vspltisb v1,1 vsl v0,v0,v1 +#endif vandv0,v0,mask_64bit +#ifndef REFLECT + /* +* Now for the Barrett reduction algorithm. The idea is to calculate q, +* the multiple of our polynomial that we need to subtract. By +* doing the computation 2x bits higher (ie 64 bits) and shifting the +* result back down 2x bits, we round down to the nearest multiple. +*/ + VPMSUMD(v1,v0,const1) /* ma */ + vsldoi v1,zeroes,v1,8 /* q = floor(ma/(2^64)) */ + VPMSUMD(v1,v1,const2) /* qn */ + vxorv0,v0,v1/* a - qn, subtraction is xor in GF(2) */ /* +* Get the result into r3. We need to shift it left 8 bytes: +* V0 [ 0 1 2 X ] +* V0 [ 0 X 2 3 ] +*/ + vsldoi v0,v0,zeroes,8 /* shift result into top 64 bits */ +#else + /* * The reflected version of Barrett reduction. Instead of bit * reflecting our data (which is expensive to do), we bit reflect our * constants and our algorithm, which means the intermediate data in @@ -537,6 +565,7 @@ FUNC_START(CRC_FUNCTION_NAME) * V0 [ 0 X 2 3 ] */ vsldoi v0,v0,zeroes,4 /* shift result into top 64 bits of */ +#endif /* Get it into r3 */ MFVRD(R3, v0) diff --git a/arch/powerpc/crypto/crc32c-vpmsum_asm.S b/arch/powerpc/crypto/crc32c-vpmsum_asm.S index c0d080caefc1..d2bea48051a0 100644 --- a/arch/powerpc/crypto/crc32c-vpmsum_asm.S +++ b/arch/powerpc/crypto/crc32c-vpmsum_asm.S @@ -842,4 +842,5 @@ .octa 0x000105ec76f1 #define CRC_FUNCTION_NAME __crc32c_vpmsum +#define REFLECT #include "crc32-vpmsum_core.S" -- 2.9.3
Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src
On Wed, Mar 15, 2017 at 05:20:15PM +1100, Michael Ellerman wrote: > > I see no implementation; so why are you poking at it. > > Maddy has posted an implementation of the kernel part for powerpc in > patch 2 of this series, but maybe you're not on Cc? I am not indeed. That and a completely inadequate Changelog have lead to great confusion.
Re: [PATCH] drivers/pcmcia: NO_IRQ removal for electra_cf.c
On Wed, Mar 15, 2017 at 6:35 AM, Michael Ellermanwrote: > Arnd Bergmann writes: >> On Tue, Mar 14, 2017 at 11:51 AM, Michael Ellerman >> wrote: >>> Michael Ellerman writes: >>> >> >> drivers/ata/pata_mpc52xx.c: if (ata_irq == NO_IRQ) { >> drivers/ata/sata_dwc_460ex.c:#ifndef NO_IRQ >> drivers/ata/sata_dwc_460ex.c:#define NO_IRQ 0 >> drivers/ata/sata_dwc_460ex.c: if (hsdev->dma->irq == NO_IRQ) { >> drivers/ata/sata_dwc_460ex.c: if (irq == NO_IRQ) { >> drivers/iommu/fsl_pamu.c: if (irq == NO_IRQ) { >> drivers/iommu/fsl_pamu.c: if (irq != NO_IRQ) >> drivers/media/platform/fsl-viu.c: if (viu_irq == NO_IRQ) { >> drivers/mtd/nand/mpc5121_nfc.c: if (prv->irq == NO_IRQ) { >> drivers/pcmcia/electra_cf.c:cf->irq = NO_IRQ; >> drivers/pcmcia/electra_cf.c:if (cf->irq != NO_IRQ) >> drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if >> no interrupt is pending. */ >> drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ; >> drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if >> no interrupt is pending. */ >> drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ; >> drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_low == NO_IRQ) { >> drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_high != NO_IRQ && >> drivers/spi/spi-mpc52xx.c: if (status && (irq != NO_IRQ)) >> drivers/tty/ehv_bytechan.c: if (stdout_irq == NO_IRQ) { >> drivers/tty/ehv_bytechan.c: if ((bc->rx_irq == NO_IRQ) || >> (bc->tx_irq == NO_IRQ)) { >> drivers/tty/serial/cpm_uart/cpm_uart_core.c:if (pinfo->port.irq == >> NO_IRQ) { >> drivers/uio/uio_fsl_elbc_gpcm.c:if (irq != NO_IRQ) { >> drivers/uio/uio_fsl_elbc_gpcm.c:irq = NO_IRQ; >> drivers/uio/uio_fsl_elbc_gpcm.c: irq != NO_IRQ ? irq : -1); >> drivers/usb/host/ehci-grlib.c: if (irq == NO_IRQ) { >> drivers/usb/host/ehci-ppc-of.c: if (irq == NO_IRQ) { >> drivers/usb/host/fhci-hcd.c:if (usb_irq == NO_IRQ) { >> drivers/usb/host/ohci-ppc-of.c: if (irq == NO_IRQ) { >> drivers/usb/host/uhci-grlib.c: if (irq == NO_IRQ) { >> drivers/video/fbdev/mb862xx/mb862xxfbdrv.c: if (par->irq == NO_IRQ) { >> drivers/virt/fsl_hypervisor.c: if (!handle || (irq == NO_IRQ)) { >> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) >> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) >> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) >> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) >> include/soc/fsl/qe/qe_ic.h: if (cascade_irq == NO_IRQ) >> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) >> >> Did you have other pending patches for those? > > No. I stayed away from anything FSL related as I was under the > impression some of them were being ported to arch/arm, which uses -1 for > NO_IRQ IIUIC. > > eg. all of include/soc/fsl and drivers/soc/fsl was moved from > arch/powerpc in commit 7aa1aa6ecec2, which said: > > QE: Move QE from arch/powerpc to drivers/soc > > ls1 has qe and ls1 has arm cpu. > move qe from arch/powerpc to drivers/soc/fsl > to adapt to powerpc and arm > > But looking at the Kconfigs it looks like they're still only selectable > on PPC. So that's a bit annoying. > > I'll do patches for everything above that's not drivers/soc or > include/soc and hopefully we can hear from someone at NXP on the plans > for getting the soc parts enabled on arm. I think the removal of the NO_IRQ references is a requirement for getting the drivers working properly on ARM, as the OF platform code will use '0' for invalid IRQs. Arnd
Re: [PATCH] powerpc/powernv: de-deuplicate OPAL call wrappers
Oliver O'Halloranwrites: > Currently the code to perform an OPAL call is duplicated between the > normal path and path taken when tracepoints are enabled. There's no > real need for this and combining them makes opal_tracepoint_entry > considerably easier to understand. Sorry I missed this when you sent it. I don't think it applies any more, but if you rebased it I'd merge it. cheers
Re: [PATCH 5/5] powerpc/smp: Add Power9 scheduler topology
Balbir Singhwrites: > On 02/03/17 11:49, Oliver O'Halloran wrote: >> In previous generations of Power processors each core had a private L2 >> cache. The Power9 processor has a slightly different architecture where >> the L2 cache is shared among pairs of cores rather than being completely >> private. >> >> Making the scheduler aware of this cache sharing allows the scheduler to >> make more intelligent migration decisions. When one core in the pair is >> overloaded tasks can be migrated to its paired core to improve throughput >> without cache-refilling penality typically associated with task >> migration. > > Could you please describe the changes to sched_domains w.r.t before and > after for P9. Yes please. cheers
Re: [PATCH 5/5] powerpc/smp: Add Power9 scheduler topology
Oliver O'Halloranwrites: > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 5571f30ff72d..5e1811b24415 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -724,10 +724,17 @@ static void update_cpu_masks(int cpu, bool onlining) > > update_thread_mask(cpu, onlining); > > + /* we need the l2 cache mask for the power9 scheduler topology */ > + if (cpu_has_feature(CPU_FTR_ARCH_300)) > + update_mask_by_l2(cpu, onlining, cpu_cache_mask); I guess I missed something somewhere, why do we need to check the CPU feature? ... > @@ -829,7 +862,10 @@ void __init smp_cpus_done(unsigned int max_cpus) > > dump_numa_cpu_topology(); > > - set_sched_topology(powerpc_topology); > + if (cpu_has_feature(CPU_FTR_ARCH_300)) > + set_sched_topology(power9_topology); > + else > + set_sched_topology(powerpc_topology); Ideally this would just all come from the device tree. If we detect that the L2 is shared then we tell the scheduler that, we shouldn't need to know that P9 is special. It certainly doesn't say anywhere in ISA 3.00 that the L2 is shared. cheers
Re: [PATCH 4/5] powerpc/smp: add cpu_cache_mask
Oliver O'Halloranwrites: > Traditionally we have only ever tracked which CPUs are in the same core > (cpu_sibling_mask) and on the same die (cpu_core_mask). For Power9 we > need to be aware of which CPUs share cache with each other so this patch > adds cpu_cache_mask and the underlying cpu_cache_map variable to track > this. But which cache? Some CPUs on Power8 share L3, or L4. I think just call it cpu_l2cache_map to make it explicit. cheers
Re: [PATCH 3/5] powerpc/smp: Add update_cpu_masks()
Oliver O'Halloranwrites: > When adding and removing a CPU from the system the per-cpu masks that > are used by the scheduler to construct scheduler domains need to be updated > to account for the cpu entering or exiting the system. Currently logic this > is open-coded for the thread sibling mask and shared for the core mask. "logic this is" > This patch moves all the logic for rebuilding these masks into a single > function and simplifies the logic which determines which CPUs are within > a "core". The diff is hard to read but I think it's OK. Other than ... > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 1c531887ca51..3922cace927e 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -630,14 +630,20 @@ int cpu_first_thread_of_core(int core) > } > EXPORT_SYMBOL_GPL(cpu_first_thread_of_core); > > -static void traverse_siblings_chip_id(int cpu, bool add, int chipid) > +static bool update_core_mask_by_chip_id(int cpu, bool add) ^ call it onlining like below > { > const struct cpumask *mask = add ? cpu_online_mask : cpu_present_mask; > + int chipid = cpu_to_chip_id(cpu); > int i; > > + if (chipid == -1) > + return false; > + > for_each_cpu(i, mask) > if (cpu_to_chip_id(i) == chipid) > set_cpus_related(cpu, i, add, cpu_core_mask); > + > + return true; > } > > /* Must be called when no change can occur to cpu_present_mask, > @@ -662,42 +668,72 @@ static struct device_node *cpu_to_l2cache(int cpu) > return cache; > } > > -static void traverse_core_siblings(int cpu, bool add) > +static bool update_core_mask_by_l2(int cpu, bool onlining) > { > + const struct cpumask *mask = onlining ? cpu_online_mask : > cpu_present_mask; > struct device_node *l2_cache, *np; > - const struct cpumask *mask; > - int chip_id; > int i; > > - /* threads that share a chip-id are considered siblings (same die) */ > - chip_id = cpu_to_chip_id(cpu); > - > - if (chip_id >= 0) { > - traverse_siblings_chip_id(cpu, add, chip_id); > - return; > - } > - > - /* if the chip-id fails then group siblings by the L2 cache */ > l2_cache = cpu_to_l2cache(cpu); > - mask = add ? cpu_online_mask : cpu_present_mask; > + if (l2_cache == NULL) > + return false; > + > for_each_cpu(i, mask) { > np = cpu_to_l2cache(i); > if (!np) > continue; > > if (np == l2_cache) > - set_cpus_related(cpu, i, add, cpu_core_mask); > + set_cpus_related(cpu, i, onlining, cpu_core_mask); > > of_node_put(np); > } > of_node_put(l2_cache); > + > + return true; > +} > + > +static void update_thread_mask(int cpu, bool onlining) > +{ > + int base = cpu_first_thread_sibling(cpu); > + int i; > + > + pr_info("CPUDEBUG: onlining cpu %d, base %d, thread_per_core %d", > + cpu, base, threads_per_core); That's too spammy for cpu hotplug. Make it pr_debug() at most. cheers
Re: [PATCH 2/5] powerpc/smp: add set_cpus_related()
Oliver O'Halloranwrites: > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index dfe0e1d9cd06..1c531887ca51 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -377,6 +377,25 @@ static void smp_store_cpu_info(int id) > #endif > } > > +/* > + * Relationships between CPUs are maintained in a set of per-cpu cpumasks. We > + * need to ensure that they are kept consistant between CPUs when they are > + * changed. > + * > + * This is slightly tricky since the core mask must be a strict superset of > + * the sibling mask. > + */ > +static void set_cpus_related(int i, int j, bool related, struct cpumask > *(*relation_fn)(int)) > +{ > + if (related) { > + cpumask_set_cpu(i, relation_fn(j)); > + cpumask_set_cpu(j, relation_fn(i)); > + } else { > + cpumask_clear_cpu(i, relation_fn(j)); > + cpumask_clear_cpu(j, relation_fn(i)); > + } > +} I think you pushed the abstraction one notch too far on this one, or perhaps not far enough. We end up with a function called "set" that might clear, depending on a bool you pass. Which is hard to parse, eg: set_cpus_related(cpu, base + i, false, cpu_sibling_mask); And I know there's two places where we pass an existing bool "add", but there's four where we pass true or false. If we want to push it in that direction I think we should just pass the set/clear routine instead of the flag, so: do_cpus_related(cpu, base + i, cpumask_clear_cpu, cpu_sibling_mask); But that might be overdoing it. So I think we should just do: static void set_cpus_related(int i, int j, struct cpumask *(*mask_func)(int)) { cpumask_set_cpu(i, mask_func(j)); cpumask_set_cpu(j, mask_func(i)); } static void clear_cpus_related(int i, int j, struct cpumask *(*mask_func)(int)) { cpumask_clear_cpu(i, mask_func(j)); cpumask_clear_cpu(j, mask_func(i)); } So the cases with add become: if (add) set_cpus_related(cpu, i, cpu_core_mask(i)); else clear_cpus_related(cpu, i, cpu_core_mask(i)); Which is not as pretty but more explicit. And the other cases look much better, eg: clear_cpus_related(cpu, base + i, cpu_sibling_mask); ?? cheers
Re: [PATCH 1/5] powerpc/smp: use cpu_to_chip_id() to find siblings
Oliver O'Halloranwrites: > To determine which logical CPUs are on the same core the kernel uses the > ibm,chipid property from the device tree node associated with that cpu. > The lookup for this this information is currently open coded in both > traverse_siblings() and traverse_siblings_chip_id(). This patch replaces > these manual lookups with the existing cpu_to_chip_id() function. Some minor nits. cpu_to_chip_id() actually searches recursively up the parents until it finds a ibm,chip-id, so it's not a 1:1 replacement for the existing logic, but it's probably still an OK conversion. It's still worth mentioning in the change log thought. > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 893bd7f79be6..dfe0e1d9cd06 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -664,23 +655,19 @@ static void traverse_core_siblings(int cpu, bool add) > { > struct device_node *l2_cache, *np; > const struct cpumask *mask; > - int i, chip, plen; > - const __be32 *prop; > + int chip_id; > + int i; > > - /* First see if we have ibm,chip-id properties in cpu nodes */ > - np = of_get_cpu_node(cpu, NULL); > - if (np) { > - chip = -1; > - prop = of_get_property(np, "ibm,chip-id", ); > - if (prop && plen == sizeof(int)) > - chip = of_read_number(prop, 1); > - of_node_put(np); > - if (chip >= 0) { > - traverse_siblings_chip_id(cpu, add, chip); > - return; > - } > + /* threads that share a chip-id are considered siblings (same die) */ You might know it means the "same die", but AFAIK there's no actual definition for what the chip-id means, so let's not write comments that might be wrong in future. Just saying they're considered siblings is sufficient. Also "Threads" :) cheers
Re: [PATCH 2/3] powernv:idle: Don't override default/deepest directly in kernel
Hi Nick, Thanks for reviewing the patch. On Wed, Mar 15, 2017 at 12:05:43AM +1000, Nicholas Piggin wrote: > On Mon, 13 Mar 2017 11:31:27 +0530 > "Gautham R. Shenoy"wrote: > > > From: "Gautham R. Shenoy" > > > > Currently during idle-init on power9, if we don't find suitable stop > > states in the device tree that can be used as the > > default_stop/deepest_stop, we set stop0 (ESL=1,EC=1) as the default > > stop state psscr to be used by power9_idle and deepest stop state > > which is used by CPU-Hotplug. > > > > However, if the platform firmware has not configured or enabled a stop > > state, the kernel should not make any assumptions and fallback to a > > default choice. > > > > If the kernel uses a stop state that is not configured by the platform > > firmware, it may lead to further failures which should be avoided. > > > > In this patch, we modify the init code to ensure that the kernel uses > > only the stop states exposed by the firmware through the device > > tree. When a suitable default stop state isn't found, we disable > > ppc_md.power_save for power9. Similarly, when a suitable > > deepest_stop_state is not found in the device tree exported by the > > firmware, fall back to the default busy-wait loop in the CPU-Hotplug > > code. > > Seems reasonable. I have a few comments that you may consider. Nothing > too major. > > Btw., it would be nice to move this hotplug idling selection code to > idle.c. Have the hotplug just ask to enter the best available idle mode > and that's it. I'm not asking you to do that for this series, but perhaps > consider it for the future. That's not a bad idea. I will do it in the respin of the patchset. > > > > diff --git a/arch/powerpc/platforms/powernv/idle.c > > b/arch/powerpc/platforms/powernv/idle.c > > index 4ee837e..9fde6e4 100644 > > --- a/arch/powerpc/platforms/powernv/idle.c > > +++ b/arch/powerpc/platforms/powernv/idle.c > > @@ -147,7 +147,6 @@ u32 pnv_get_supported_cpuidle_states(void) > > } > > EXPORT_SYMBOL_GPL(pnv_get_supported_cpuidle_states); > > > > - > > static void pnv_fastsleep_workaround_apply(void *info) > > > > { > > @@ -243,6 +242,7 @@ static DEVICE_ATTR(fastsleep_workaround_applyonce, 0600, > > */ > > u64 pnv_default_stop_val; > > u64 pnv_default_stop_mask; > > +bool default_stop_found; > > > > /* > > * Used for ppc_md.power_save which needs a function with no parameters > > @@ -264,6 +264,7 @@ static void power9_idle(void) > > */ > > u64 pnv_deepest_stop_psscr_val; > > u64 pnv_deepest_stop_psscr_mask; > > +bool deepest_stop_found; > > > > /* > > * Power ISA 3.0 idle initialization. > > If the hotplug idle code was in idle.c, then all this deepest/default stop > logic and register settings would be static to idle.c, which would be nice. > > If you have a function to check if deepest stop is found, then you don't need > a non-static variable here (or for default_stop_found AFAIKS). Sure! > > > > @@ -352,7 +353,6 @@ static int __init pnv_power9_idle_init(struct > > device_node *np, u32 *flags, > > u32 *residency_ns = NULL; > > u64 max_residency_ns = 0; > > int rc = 0, i; > > - bool default_stop_found = false, deepest_stop_found = false; > > > > psscr_val = kcalloc(dt_idle_states, sizeof(*psscr_val), GFP_KERNEL); > > psscr_mask = kcalloc(dt_idle_states, sizeof(*psscr_mask), GFP_KERNEL); > > @@ -433,20 +433,22 @@ static int __init pnv_power9_idle_init(struct > > device_node *np, u32 *flags, > > } > > > > if (!default_stop_found) { > > - pnv_default_stop_val = PSSCR_HV_DEFAULT_VAL; > > - pnv_default_stop_mask = PSSCR_HV_DEFAULT_MASK; > > - pr_warn("Setting default stop psscr > > val=0x%016llx,mask=0x%016llx\n", > > + pr_warn("powernv:idle: Default stop not found. Disabling > > ppcmd.powersave\n"); > > + } else { > > + pr_info("powernv:idle: Default stop: psscr = > > 0x%016llx,mask=0x%016llx\n", > > pnv_default_stop_val, pnv_default_stop_mask); > > } > > > > if (!deepest_stop_found) { > > - pnv_deepest_stop_psscr_val = PSSCR_HV_DEFAULT_VAL; > > - pnv_deepest_stop_psscr_mask = PSSCR_HV_DEFAULT_MASK; > > - pr_warn("Setting default stop psscr > > val=0x%016llx,mask=0x%016llx\n", > > + pr_warn("powernv:idle: Deepest stop not found. CPU-Hotplug is > > affected\n"); > > I guess these warnings are meant for developers, but it might be nice > to make them slightly more meaningful? Prefix of choice seems to be > "cpuidle-powernv:" > > Then you could have "no suitable stop state provided by firmware. Disabling > idle power saving" and "no suitable stop state provided by firmware. Offlined > CPUs will busy-wait", perhaps? How about pr_warn("cpuidle-powernv: No suitable stop for CPU-Hotplug. Offlined CPUs will busy wait\n"); > > Just a suggestion. > > > + } else { > > +
Re: [FIX PATCH v1] powerpc/pseries: Fix reference count leak during CPU unplug
>> So you suggest that adding of_node_get() to __of_attach_node_sysfs() >> is the right fix ? > > If I understand that this only creates for hot-added cpus then no. Also > for this to be the correct fix I would expect to see this recreate for > all hot-remove operations such as memory and pci devices as well. > So I can recreate this problem while removing a I/O adapter as well. Here is a trace against 4.11.0-rc1-next20170310 while performing a DLPAR remove operation on a I/O adapter. [ 549.815605] rpaphp: Slot [U78C7.001.RCH0042-P1-C1] registered [ 549.815608] rpadlpar_io: slot PHB 64 added [ 575.267302] iommu: Removing device 0040:01:00.0 from group 1 [ 575.267401] iommu: Removing device 0040:01:00.1 from group 1 [ 575.267842] refcount_t: underflow; use-after-free. [ 575.267855] [ cut here ] [ 575.267862] WARNING: CPU: 2 PID: 3837 at lib/refcount.c:128 refcount_sub_and_test+0xf4/0x110 [ 575.267865] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag rpcrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm iw_cxgb3 ib_core ghash_generic xts gf128mul tpm_ibmvtpm tpm vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sr_mod sd_mod cdrom cxgb3 ibmvscsi ibmveth scsi_transport_srp mdio dm_mirror dm_region_hash dm_log dm_mod [ 575.267904] CPU: 2 PID: 3837 Comm: drmgr Not tainted 4.11.0-rc1-next-20170310 #4 [ 575.267907] task: c0076f041600 task.stack: c0076f084000 [ 575.267910] NIP: c1aa69c4 LR: c1aa69c0 CTR: 006338e4 [ 575.267913] REGS: c0076f0878a0 TRAP: 0700 Not tainted (4.11.0-rc1-next-20170310) [ 575.267915] MSR: 80029033[ 575.267920] CR: 42002422 XER: 0007 [ 575.267923] CFAR: c1edb5e0 SOFTE: 1 [ 575.267923] GPR00: c1aa69c0 c0076f087b20 c2605f00 0026 [ 575.267923] GPR04: 80100fe93ec0 00492b9a [ 575.267923] GPR08: 0001 0007 0006 3ff0 [ 575.267923] GPR12: 2200 ce801200 [ 575.267923] GPR16: [ 575.267923] GPR20: [ 575.267923] GPR24: c001db50a78c 10016650 [ 575.267923] GPR28: c001dd1a7500 c001dd1a7200 c001db50a780 c001dd1a7258 [ 575.267955] NIP [c1aa69c4] refcount_sub_and_test+0xf4/0x110 [ 575.267958] LR [c1aa69c0] refcount_sub_and_test+0xf0/0x110 [ 575.267960] Call Trace: [ 575.267962] [c0076f087b20] [c1aa69c0] refcount_sub_and_test+0xf0/0x110 (unreliable) [ 575.267967] [c0076f087b80] [c1a84f1c] kobject_put+0x3c/0xa0 [ 575.267972] [c0076f087bf0] [c1d239b4] of_node_put+0x24/0x40 [ 575.267976] [c0076f087c10] [c165ce74] ofdt_write+0x204/0x6b0 [ 575.267980] [c0076f087cd0] [c197bde0] proc_reg_write+0x80/0xd0 [ 575.267984] [c0076f087d00] [c18df040] __vfs_write+0x40/0x1c0 [ 575.267987] [c0076f087d90] [c18e0998] vfs_write+0xc8/0x240 [ 575.267991] [c0076f087de0] [c18e2600] SyS_write+0x60/0x110 [ 575.267995] [c0076f087e30] [c15cb184] system_call+0x38/0xe0 [ 575.267997] Instruction dump: [ 575.267999] 7863d182 4e800020 7c0802a6 3921 3d42fff8 3c62ffb1 386306a0 992afd41 [ 575.268004] f8010010 f821ffa1 48434be5 6000 <0fe0> 38210060 3860 e8010010 [ 575.268010] ---[ end trace e6c0a4371a0aa4e2 ]— Thanks -Sachin
Re: [PATCH v2 16/23] MAINTAINERS: Add file patterns for powerpc device tree bindings
Hi Michael, On Wed, Mar 15, 2017 at 10:30 AM, Michael Ellermanwrote: > Geert Uytterhoeven writes: >> On Wed, Mar 15, 2017 at 1:19 AM, Michael Ellerman >> wrote: >>> Geert Uytterhoeven writes: Submitters of device tree binding documentation may forget to CC the subsystem maintainer if this is missing. Signed-off-by: Geert Uytterhoeven Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: linuxppc-dev@lists.ozlabs.org --- Please apply this patch directly if you want to be involved in device tree binding documentation for your subsystem. v2: - Rebased on top of commit a42715830d552d7c ("MAINTAINERS: Remove powerpc's "opal" pattern match), which just added "powerpc/opal", while obviously the whole "powerpc" hierarchy is of interest. Impact on next-20170310: >>> >>> Actual diff ignoring ordering etc: >>> >>> +Benjamin Herrenschmidt (supporter:LINUX FOR >>> POWERPC (32-BIT AND 64-BIT)) >>> +Paul Mackerras (supporter:LINUX FOR POWERPC (32-BIT >>> AND 64-BIT)) >>> +linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND >>> 64-BIT)) >>> -Scott Wood (commit_signer:5/11=45%) >>> -Zhao Qiang (commit_signer:4/11=36%,authored:4/11=36%) >>> -Christian Lamparter (commit_signer:1/11=9%) >>> -yangbo lu (authored:1/11=9%) >>> -"Otto Kekäläinen" (authored:1/11=9%) >>> -Chris Packham (authored:1/11=9%) >>> -York Sun (authored:1/11=9%) >>> >>> Which looks bad as all the NXP folks get dropped, but they should be >>> reading linuxppc-dev. So I think I'll merge this, unless anyone >>> disagrees. >> >> They got dropped because in the absence of a maintainer entry, the >> last committers are listed. If they need to be listed, I can send patches to >> add >> more specific entries for 4xx and fsl DT bindings, like: > > This is one is probably worth doing: > >> diff --git a/MAINTAINERS b/MAINTAINERS >> index 0f1c2f96c99aa936..40b392a4f399adbe 100644 >> --- a/MAINTAINERS >> +++ b/MAINTAINERS >> @@ -5266,6 +5266,7 @@ M:Scott Wood >> L: linuxppc-dev@lists.ozlabs.org >> L: linux-arm-ker...@lists.infradead.org >> S: Maintained >> +F: Documentation/devicetree/bindings/powerpc/fsl/ >> F: drivers/soc/fsl/ >> F: include/linux/fsl/ > > I'll fold it in before applying? Fine (less work) for me! Thanks! > But I wouldn't bother with 4xx, just falling back to linuxppc-dev is > fine for that. OK. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH v2 16/23] MAINTAINERS: Add file patterns for powerpc device tree bindings
Geert Uytterhoevenwrites: > On Wed, Mar 15, 2017 at 1:19 AM, Michael Ellerman wrote: >> Geert Uytterhoeven writes: >>> Submitters of device tree binding documentation may forget to CC >>> the subsystem maintainer if this is missing. >>> >>> Signed-off-by: Geert Uytterhoeven >>> Cc: Benjamin Herrenschmidt >>> Cc: Paul Mackerras >>> Cc: Michael Ellerman >>> Cc: linuxppc-dev@lists.ozlabs.org >>> --- >>> Please apply this patch directly if you want to be involved in device >>> tree binding documentation for your subsystem. >>> >>> v2: >>> - Rebased on top of commit a42715830d552d7c ("MAINTAINERS: Remove >>> powerpc's "opal" pattern match), which just added "powerpc/opal", >>> while obviously the whole "powerpc" hierarchy is of interest. >>> >>> Impact on next-20170310: >> >> Actual diff ignoring ordering etc: >> >> +Benjamin Herrenschmidt (supporter:LINUX FOR >> POWERPC (32-BIT AND 64-BIT)) >> +Paul Mackerras (supporter:LINUX FOR POWERPC (32-BIT AND >> 64-BIT)) >> +linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND >> 64-BIT)) >> -Scott Wood (commit_signer:5/11=45%) >> -Zhao Qiang (commit_signer:4/11=36%,authored:4/11=36%) >> -Christian Lamparter (commit_signer:1/11=9%) >> -yangbo lu (authored:1/11=9%) >> -"Otto Kekäläinen" (authored:1/11=9%) >> -Chris Packham (authored:1/11=9%) >> -York Sun (authored:1/11=9%) >> >> Which looks bad as all the NXP folks get dropped, but they should be >> reading linuxppc-dev. So I think I'll merge this, unless anyone >> disagrees. > > They got dropped because in the absence of a maintainer entry, the > last committers are listed. If they need to be listed, I can send patches to > add > more specific entries for 4xx and fsl DT bindings, like: This is one is probably worth doing: > diff --git a/MAINTAINERS b/MAINTAINERS > index 0f1c2f96c99aa936..40b392a4f399adbe 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -5266,6 +5266,7 @@ M:Scott Wood > L: linuxppc-dev@lists.ozlabs.org > L: linux-arm-ker...@lists.infradead.org > S: Maintained > +F: Documentation/devicetree/bindings/powerpc/fsl/ > F: drivers/soc/fsl/ > F: include/linux/fsl/ I'll fold it in before applying? But I wouldn't bother with 4xx, just falling back to linuxppc-dev is fine for that. > @@ -7540,6 +7541,7 @@ M:Matt Porter > W: http://www.penguinppc.org/ > L: linuxppc-dev@lists.ozlabs.org > S: Maintained > +F: Documentation/devicetree/bindings/powerpc/4xx/ > F: arch/powerpc/platforms/40x/ > F: arch/powerpc/platforms/44x/ cheers
Re: [PATCH v5 2/5] powerpc: kretprobes: override default function entry offset
On 2017/03/14 10:18AM, Arnaldo Carvalho de Melo wrote: > Em Thu, Mar 09, 2017 at 05:37:38PM +1100, Michael Ellerman escreveu: > > "Naveen N. Rao"writes: > > > On 2017/03/08 11:29AM, Arnaldo Carvalho de Melo wrote: > > >> > I wasn't sure if you were planning on picking up KPROBES_ON_FTRACE for > > >> > v4.11. If so, it would be good to take this patch through the powerpc > > >> > tree. Otherwise, this can go via Ingo's tree. > > >> > > >> If you guys convince Ingo that this should go _now_, then just cherry > > >> pick what was merged into tip/perf/core that is needed for the arch > > >> specific stuff and go from there. > > > > > > Ok, in hindsight, I think Michael's concern was actually for v4.12 > > > > Yes I was talking about 4.12, sorry I thought that was implied :) > > > > > itself, in which case this particular patch can go via powerpc tree, > > > while the rest of the patches in this series can go via your tree. > > > > > > Michael? > > > > Yeah I think that's the easiest option. The function will be temporarily > > unused until the two trees are merged, but I think that's fine. > > Ok, done that, now compile testing building it in my > multi-distro/x-build containers. Thanks, Arnaldo! I did however notice that you don't seem to have applied Patch 1/5 from this series: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1347858.html That patch is needed to ensure perf continues to work when ftrace README advertises support for ref_reloc_sym+offset for kretprobes. Can you please apply that as well? - Naveen
Re: [PATCH v3 1/2] powerpc: split ftrace bits into a separate file
On 2017/03/10 11:54AM, Steven Rostedt wrote: > On Fri, 10 Mar 2017 21:38:53 +0530 > "Naveen N. Rao"wrote: > > > On 2017/03/10 10:45AM, Steven Rostedt wrote: > > > On Thu, 02 Mar 2017 20:38:53 +1100 > > > Michael Ellerman wrote: > > > > > So if we drop that we're left with ftrace.S - which seems perfect to > > > > me. > > > > > > Yeah, I agree. But then there's the problem that ftrace.c and ftrace.S > > > will get the same ftrace.o. Maybe make it ftrace-hook.S ? > > > > I've avoided that issue by naming the files ftrace_32.S and ftrace_64.S > > (which gets further split up). > > That's what I looked at doing for x86 as well. But not all archs have > 32 / 64 splits. Should we look to have something that all archs can be > consistent with? I don't have a strong opinion about this, but I feel that x86 can simply use ftrace_64.S, seeing as the current name is mcount_64.S. Other architectures could do something similar too, or fall back to ftrace_hook.S. That way, all ftrace low-level code can simply be referred to as arch/*/ftrace_*.S - Naveen
Re: [PATCH 1/2] powerpc/mm: Refactor page table allocation
Balbir Singhwrites: > Introduce a helper pgtable_get_gfp_flags() which Can we just call it pgtable_gfp_flags() ? > just returns the current gfp flags. In a future > patch, we can enable __GFP_ACCOUNT based on the > calling context. > > Signed-off-by: Balbir Singh > --- > arch/powerpc/include/asm/book3s/64/pgalloc.h | 22 -- > arch/powerpc/mm/pgtable_64.c | 3 ++- It looks like you've only updated the 64-bit Book3S sites. Can you please do all of them. I think this is the full list: arch/powerpc/include/asm/book3s/32/pgalloc.h:static inline pgd_t *pgd_alloc(struct mm_struct *mm) arch/powerpc/include/asm/book3s/64/pgalloc.h:static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm) arch/powerpc/include/asm/book3s/64/pgalloc.h:static inline pgd_t *pgd_alloc(struct mm_struct *mm) arch/powerpc/include/asm/nohash/32/pgalloc.h:static inline pgd_t *pgd_alloc(struct mm_struct *mm) arch/powerpc/include/asm/nohash/64/pgalloc.h:static inline pgd_t *pgd_alloc(struct mm_struct *mm) arch/powerpc/include/asm/book3s/64/pgalloc.h:static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) arch/powerpc/include/asm/nohash/64/pgalloc.h:static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) arch/powerpc/include/asm/book3s/64/pgalloc.h:static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) arch/powerpc/include/asm/nohash/64/pgalloc.h:static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) arch/powerpc/include/asm/book3s/64/pgalloc.h:static inline pgtable_t pte_alloc_one(struct mm_struct *mm, arch/powerpc/include/asm/book3s/64/pgalloc.h:static inline pgtable_t pte_alloc_one(struct mm_struct *mm, arch/powerpc/include/asm/nohash/64/pgalloc.h:static inline pgtable_t pte_alloc_one(struct mm_struct *mm, arch/powerpc/include/asm/nohash/64/pgalloc.h:static inline pgtable_t pte_alloc_one(struct mm_struct *mm, arch/powerpc/mm/pgtable_32.c:pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address) arch/powerpc/mm/hugetlbpage.c:pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz) > diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h > b/arch/powerpc/include/asm/book3s/64/pgalloc.h > index cd5e7aa..d0a9ca6 100644 > --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h > +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h > @@ -159,7 +168,8 @@ static inline pgtable_t pmd_pgtable(pmd_t pmd) > static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, > unsigned long address) > { > - return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO); > + return (pte_t *)__get_free_page( > + pgtable_get_gfp_flags(mm, PGALLOC_GFP)); > } There's no point doing pte_alloc_one_kernel(), it's explicitly for kernel allocations IIUI. cheers
[v2 PATCH] powernv-cpuidle: Validate DT property array size
From: "Gautham R. Shenoy"The various properties associated with powernv idle states such as names, flags, residency-ns, latencies-ns, psscr, psscr-mask are exposed in the device-tree as property arrays such the pointwise entries in each of these arrays correspond to the properties of the same idle state. This patch validates that the lengths of the property arrays are the same. If there is a mismatch, the patch will ensure that we bail out and not expose the platform idle states via cpuidle. Signed-off-by: Gautham R. Shenoy --- v1: https://lkml.org/lkml/2017/2/23/349 Changes from v1: Print the full property array name in warning message. drivers/cpuidle/cpuidle-powernv.c | 64 +-- 1 file changed, 61 insertions(+), 3 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 3705930..a06df51 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -197,11 +197,25 @@ static inline void add_powernv_state(int index, const char *name, stop_psscr_table[index].mask = psscr_mask; } +/* + * Returns 0 if prop1_len == prop2_len. Else returns -1 + */ +static inline int validate_dt_prop_sizes(const char *prop1, int prop1_len, +const char *prop2, int prop2_len) +{ + if (prop1_len == prop2_len) + return 0; + + pr_warn("cpuidle-powernv: array sizes don't match for %s and %s\n", + prop1, prop2); + return -1; +} + static int powernv_add_idle_states(void) { struct device_node *power_mgt; int nr_idle_states = 1; /* Snooze */ - int dt_idle_states; + int dt_idle_states, count; u32 latency_ns[CPUIDLE_STATE_MAX]; u32 residency_ns[CPUIDLE_STATE_MAX]; u32 flags[CPUIDLE_STATE_MAX]; @@ -226,6 +240,21 @@ static int powernv_add_idle_states(void) goto out; } + count = of_property_count_u32_elems(power_mgt, + "ibm,cpu-idle-state-latencies-ns"); + + if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags", dt_idle_states, + "ibm,cpu-idle-state-latencies-ns", + count) != 0) + goto out; + + count = of_property_count_strings(power_mgt, + "ibm,cpu-idle-state-names"); + if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags", dt_idle_states, + "ibm,cpu-idle-state-names", + count) != 0) + goto out; + /* * Since snooze is used as first idle state, max idle states allowed is * CPUIDLE_STATE_MAX -1 @@ -260,6 +289,22 @@ static int powernv_add_idle_states(void) has_stop_states = (flags[0] & (OPAL_PM_STOP_INST_FAST | OPAL_PM_STOP_INST_DEEP)); if (has_stop_states) { + count = of_property_count_u64_elems(power_mgt, + "ibm,cpu-idle-state-psscr"); + if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags", + dt_idle_states, + "ibm,cpu-idle-state-psscr", + count) != 0) + goto out; + + count = of_property_count_u64_elems(power_mgt, + "ibm,cpu-idle-state-psscr-mask"); + if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags", + dt_idle_states, + "ibm,cpu-idle-state-psscr-mask", + count) != 0) + goto out; + if (of_property_read_u64_array(power_mgt, "ibm,cpu-idle-state-psscr", psscr_val, dt_idle_states)) { pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-psscr in DT\n"); @@ -274,8 +319,21 @@ static int powernv_add_idle_states(void) } } - rc = of_property_read_u32_array(power_mgt, - "ibm,cpu-idle-state-residency-ns", residency_ns, dt_idle_states); + count = of_property_count_u32_elems(power_mgt, + "ibm,cpu-idle-state-residency-ns"); + + if (count < 0) { + rc = count; + } else if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags", + dt_idle_states, + "ibm,cpu-idle-state-residency-ns", + count) != 0) { + goto out; + } else { + rc = of_property_read_u32_array(power_mgt, +
Re: [PATCH v2 16/23] MAINTAINERS: Add file patterns for powerpc device tree bindings
Hi Michael, On Wed, Mar 15, 2017 at 1:19 AM, Michael Ellermanwrote: > Geert Uytterhoeven writes: >> Submitters of device tree binding documentation may forget to CC >> the subsystem maintainer if this is missing. >> >> Signed-off-by: Geert Uytterhoeven >> Cc: Benjamin Herrenschmidt >> Cc: Paul Mackerras >> Cc: Michael Ellerman >> Cc: linuxppc-dev@lists.ozlabs.org >> --- >> Please apply this patch directly if you want to be involved in device >> tree binding documentation for your subsystem. >> >> v2: >> - Rebased on top of commit a42715830d552d7c ("MAINTAINERS: Remove >> powerpc's "opal" pattern match), which just added "powerpc/opal", >> while obviously the whole "powerpc" hierarchy is of interest. >> >> Impact on next-20170310: > > Actual diff ignoring ordering etc: > > +Benjamin Herrenschmidt (supporter:LINUX FOR > POWERPC (32-BIT AND 64-BIT)) > +Paul Mackerras (supporter:LINUX FOR POWERPC (32-BIT AND > 64-BIT)) > +linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND > 64-BIT)) > -Scott Wood (commit_signer:5/11=45%) > -Zhao Qiang (commit_signer:4/11=36%,authored:4/11=36%) > -Christian Lamparter (commit_signer:1/11=9%) > -yangbo lu (authored:1/11=9%) > -"Otto Kekäläinen" (authored:1/11=9%) > -Chris Packham (authored:1/11=9%) > -York Sun (authored:1/11=9%) > > Which looks bad as all the NXP folks get dropped, but they should be > reading linuxppc-dev. So I think I'll merge this, unless anyone > disagrees. They got dropped because in the absence of a maintainer entry, the last committers are listed. If they need to be listed, I can send patches to add more specific entries for 4xx and fsl DT bindings, like: diff --git a/MAINTAINERS b/MAINTAINERS index 0f1c2f96c99aa936..40b392a4f399adbe 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5266,6 +5266,7 @@ M:Scott Wood L: linuxppc-dev@lists.ozlabs.org L: linux-arm-ker...@lists.infradead.org S: Maintained +F: Documentation/devicetree/bindings/powerpc/fsl/ F: drivers/soc/fsl/ F: include/linux/fsl/ @@ -7540,6 +7541,7 @@ M:Matt Porter W: http://www.penguinppc.org/ L: linuxppc-dev@lists.ozlabs.org S: Maintained +F: Documentation/devicetree/bindings/powerpc/4xx/ F: arch/powerpc/platforms/40x/ F: arch/powerpc/platforms/44x/ OK? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src
Hi Peter, Peter Zijlstrawrites: > On Tue, Mar 14, 2017 at 02:31:51PM +0530, Madhavan Srinivasan wrote: > >> >Huh? PPC hasn't yet implemented this? Then why are you fixing it? >> >> yes, PPC hasn't implemented this (until now). > > until now where? On powerpc there is currently no kernel support for filling the data_src value with anything meaningful. A user can still request PERF_SAMPLE_DATA_SRC (perf report -d), but they just get the default value from perf_sample_data_init(), which is PERF_MEM_NA. Though even that is currently broken with a big endian perf tool. >> And did not understand "Then why are you fixing it?" > > I see no implementation; so why are you poking at it. Maddy has posted an implementation of the kernel part for powerpc in patch 2 of this series, but maybe you're not on Cc? Regardless of us wanting to do the kernel side on powerpc, the current API is broken on big endian. That's because in the kernel the PERF_MEM_NA value is constructed using shifts: /* TLB access */ #define PERF_MEM_TLB_NA 0x01 /* not available */ ... #define PERF_MEM_TLB_SHIFT26 #define PERF_MEM_S(a, s) \ (((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT) #define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\ PERF_MEM_S(LVL, NA) |\ PERF_MEM_S(SNOOP, NA) |\ PERF_MEM_S(LOCK, NA) |\ PERF_MEM_S(TLB, NA)) Which works out as: ((0x01 << 0) | (0x01 << 5) | (0x01 << 19) | (0x01 << 24) | (0x01 << 26)) Which means the PERF_MEM_NA value comes out of the kernel as 0x5080021 in CPU endian. But then in the perf tool, the code uses the bitfields to inspect the value, and currently the bitfields are defined using little endian ordering. So eg. in perf_mem__tlb_scnprintf() we see: data_src->val = 0x5080021 op = 0x0 lvl = 0x0 snoop = 0x0 lock = 0x0 dtlb = 0x0 rsvd = 0x5080021 So this patch does what I think is the minimal fix, of changing the definition of the bitfields to match the values that are already exported by the kernel on big endian. And it makes no change on little endian. cheers