Re: [PATCH 1/6 v5] Kernel DLPAR Infrastructure
On Thu, 2009-10-29 at 14:08 +1100, Benjamin Herrenschmidt wrote: On Wed, 2009-10-28 at 15:53 -0500, Nathan Fontenot wrote: + struct device_node *dn; + struct device_node *first_dn = NULL; + struct device_node *last_dn = NULL; + struct property *property; + struct property *last_property = NULL; + struct cc_workarea *ccwa; + int cc_token; + int rc; + + cc_token = rtas_token(ibm,configure-connector); + if (cc_token == RTAS_UNKNOWN_SERVICE) + return NULL; + + spin_lock(workarea_lock); + + ccwa = (struct cc_workarea *)workarea[0]; + ccwa-drc_index = drc_index; + ccwa-zero = 0; Popping a free page with gfp (or just kmalloc'ing 4K) would avoid the need for the lock too. Not kmalloc -- the alignment of the buffer isn't guaranteed when slub/slab debug is on, and iirc the work area needs to be 4K-aligned. __get_free_page should be fine, I think. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC/PATCH 0/3] checkpoint/restart for powerpc
This series is a first attempt at enabling checkpoint/restart for powerpc. It is based on Oren Laadan's v13 checkpoint/restart series: http://lkml.org/lkml/2009/1/27/227 Nathan Lynch (3): powerpc: bare minimum checkpoint/restart implementation powerpc: wire up checkpoint and restart syscalls allow checkpoint/restart on powerpc arch/powerpc/include/asm/checkpoint_hdr.h | 40 + arch/powerpc/include/asm/systbl.h |2 + arch/powerpc/include/asm/unistd.h |4 +- arch/powerpc/mm/Makefile |1 + arch/powerpc/mm/checkpoint.c | 261 + checkpoint/Kconfig|2 +- 6 files changed, 308 insertions(+), 2 deletions(-) create mode 100644 arch/powerpc/include/asm/checkpoint_hdr.h create mode 100644 arch/powerpc/mm/checkpoint.c ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 2/3] powerpc: wire up checkpoint and restart syscalls
Signed-off-by: Nathan Lynch n...@pobox.com --- arch/powerpc/include/asm/systbl.h |2 ++ arch/powerpc/include/asm/unistd.h |4 +++- 2 files changed, 5 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index 803def2..cdb60b4 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -322,3 +322,5 @@ SYSCALL_SPU(epoll_create1) SYSCALL_SPU(dup3) SYSCALL_SPU(pipe2) SYSCALL(inotify_init1) +SYSCALL(checkpoint) +SYSCALL(restart) diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h index e07d0c7..2e333a1 100644 --- a/arch/powerpc/include/asm/unistd.h +++ b/arch/powerpc/include/asm/unistd.h @@ -341,10 +341,12 @@ #define __NR_dup3 316 #define __NR_pipe2 317 #define __NR_inotify_init1 318 +#define __NR_checkpoint319 +#define __NR_restart 320 #ifdef __KERNEL__ -#define __NR_syscalls 319 +#define __NR_syscalls 321 #define __NR__exit __NR_exit #define NR_syscalls__NR_syscalls -- 1.5.5 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 3/3] allow checkpoint/restart on powerpc
Signed-off-by: Nathan Lynch n...@pobox.com --- checkpoint/Kconfig |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig index ffaa635..00036aa 100644 --- a/checkpoint/Kconfig +++ b/checkpoint/Kconfig @@ -1,7 +1,7 @@ config CHECKPOINT_RESTART prompt Enable checkpoint/restart (EXPERIMENTAL) def_bool n - depends on X86_32 EXPERIMENTAL + depends on (X86_32 || PPC) EXPERIMENTAL help Application checkpoint/restart is the ability to save the state of a running application so that it can later resume -- 1.5.5 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 1/3] powerpc: bare minimum checkpoint/restart implementation
The only thing of significance here is that the checkpointed task's pt_regs and fp state are saved and restored (see cr_write_cpu and cr_read_cpu); the rest of the code consists of dummy implementations of the APIs the arch needs to provide to the checkpoint/restart core. What works: * self and external checkpoint of simple (single thread, one open file) 32- and 64-bit processes on a ppc64 kernel What doesn't work: * restarting a 32-bit task from a 64-bit task and vice versa Untested: * ppc32 (but it builds) Signed-off-by: Nathan Lynch n...@pobox.com --- arch/powerpc/include/asm/checkpoint_hdr.h | 40 + arch/powerpc/mm/Makefile |1 + arch/powerpc/mm/checkpoint.c | 261 + 3 files changed, 302 insertions(+), 0 deletions(-) create mode 100644 arch/powerpc/include/asm/checkpoint_hdr.h create mode 100644 arch/powerpc/mm/checkpoint.c diff --git a/arch/powerpc/include/asm/checkpoint_hdr.h b/arch/powerpc/include/asm/checkpoint_hdr.h new file mode 100644 index 000..752c53f --- /dev/null +++ b/arch/powerpc/include/asm/checkpoint_hdr.h @@ -0,0 +1,40 @@ +#ifndef __ASM_PPC_CKPT_HDR_H +#define __ASM_PPC_CKPT_HDR_H +/* + * Checkpoint/restart - architecture specific headers ppc + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include linux/types.h +#include asm/ptrace.h +#include asm/mmu.h +#include asm/processor.h + +struct cr_hdr_head_arch { + __u32 unimplemented; +}; + +struct cr_hdr_thread { + __u32 unimplemented; +}; + +struct cr_hdr_cpu { + struct pt_regs pt_regs; + /* relevant fields from thread_struct */ + double fpr[32][TS_FPRWIDTH]; + unsigned int fpscr; + int fpexc_mode; + /* unsigned int align_ctl; this is never updated? */ + unsigned long dabr; +}; + +struct cr_hdr_mm_context { + __u32 unimplemented; +}; + +#endif /* __ASM_PPC_CKPT_HDR__H */ diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile index e7392b4..8a523a0 100644 --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -24,3 +24,4 @@ obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o obj-$(CONFIG_PPC_MM_SLICES)+= slice.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o obj-$(CONFIG_PPC_SUBPAGE_PROT) += subpage-prot.o +obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o diff --git a/arch/powerpc/mm/checkpoint.c b/arch/powerpc/mm/checkpoint.c new file mode 100644 index 000..8cdff24 --- /dev/null +++ b/arch/powerpc/mm/checkpoint.c @@ -0,0 +1,261 @@ +/* + * Checkpoint/restart - architecture specific support for powerpc. + * Based on x86 implementation. + * + * Copyright (C) 2008 Oren Laadan + * Copyright 2009 IBM Corp. + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#define DEBUG 1 /* for pr_debug */ + +#include linux/checkpoint.h +#include linux/checkpoint_hdr.h +#include linux/kernel.h +#include asm/processor.h + +static void cr_hdr_init(struct cr_hdr *hdr, __s16 type, __s16 len, __u32 parent) +{ + hdr-type = type; + hdr-len = len; + hdr-parent = parent; +} + +/* dump the thread_struct of a given task */ +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t) +{ + struct cr_hdr_thread *thread_hdr; + struct cr_hdr cr_hdr; + u32 parent; + int ret; + + thread_hdr = cr_hbuf_get(ctx, sizeof(*thread_hdr)); + if (!thread_hdr) + return -ENOMEM; + + parent = task_pid_vnr(t); + + cr_hdr_init(cr_hdr, CR_HDR_THREAD, sizeof(*thread_hdr), parent); + + thread_hdr-unimplemented = 0xdeadbeef; + + ret = cr_write_obj(ctx, cr_hdr, thread_hdr); + cr_hbuf_put(ctx, sizeof(*thread_hdr)); + + return ret; +} + +/* dump the cpu state and registers of a given task */ +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t) +{ + struct cr_hdr_cpu *cpu_hdr; + struct pt_regs *pt_regs; + struct cr_hdr cr_hdr; + u32 parent; + int ret; + + cpu_hdr = cr_hbuf_get(ctx, sizeof(*cpu_hdr)); + if (!cpu_hdr) + return -ENOMEM; + + parent = task_pid_vnr(t); + + cr_hdr_init(cr_hdr, CR_HDR_CPU, sizeof(*cpu_hdr), parent); + + /* pt_regs: GPRs, MSR, etc */ + pt_regs = task_pt_regs(t); + cpu_hdr-pt_regs = *pt_regs; + + /* FP state */ + memcpy(cpu_hdr-fpr, t-thread.fpr, sizeof(cpu_hdr-fpr)); + cpu_hdr-fpscr = t-thread.fpscr.val; + cpu_hdr-fpexc_mode = t-thread.fpexc_mode; + + /* Handle DABR for now, dbcr[01] later */ + cpu_hdr-dabr = t-thread.dabr; + + /* ToDo: Altivec/VSX/SPE state */ + + ret = cr_write_obj(ctx, cr_hdr, cpu_hdr); + cr_hbuf_put
Re: [PATCH 1/3] powerpc: bare minimum checkpoint/restart implementation
Hey Oren, thanks for taking a look. Oren Laadan wrote: Nathan Lynch wrote: What doesn't work: * restarting a 32-bit task from a 64-bit task and vice versa Is there a test to bail if we attempt to checkpoint such tasks ? No, but I'll add one if it looks too hard to fix for the next round. +struct cr_hdr_cpu { + struct pt_regs pt_regs; It has been suggested (as done in x86/32 code) not to use 'struct pt_regs' because it can (and has) changed on x86 and because it only container the registers that the kernel trashes, not all usermode registers. https://lists.linux-foundation.org/pipermail/containers/2008-August/012355.html Yeah, I considered that discussion, but the situation is different for powerpc (someone on linuxppc-dev smack me if I'm wrong here :) pt_regs is part of the ABI, and it encompasses all user mode registers except for floating point, which are handled separately. + /* relevant fields from thread_struct */ + double fpr[32][TS_FPRWIDTH]; Can TS_FPRWIDTH change between sub-archs or kernel versions ? If so, it needs to be stated explicitly. + unsigned int fpscr; + int fpexc_mode; + /* unsigned int align_ctl; this is never updated? */ + unsigned long dabr; Are these fields always guarantee to compile to the same number of bytes regardless of 32/64 bit choice of compiler (or sub-arch?) ? In the x86(32/64) architecture we use types with explicit size such as __u32 and the like to ensure that it always compiled to the same size. Yeah, I'll have to fix these up. +static void cr_hdr_init(struct cr_hdr *hdr, __s16 type, __s16 len, __u32 parent) +{ + hdr-type = type; + hdr-len = len; + hdr-parent = parent; +} + This function is rather generic and useful to non-arch-dependent and other architectures code. Perhaps put in a separate patch ? Alright. By the way, why are cr_hdr-type and cr_hdr-len signed types? +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t) +{ + struct cr_hdr_cpu *cpu_hdr; + struct pt_regs *pt_regs; + struct cr_hdr cr_hdr; + u32 parent; + int ret; + + cpu_hdr = cr_hbuf_get(ctx, sizeof(*cpu_hdr)); + if (!cpu_hdr) + return -ENOMEM; + + parent = task_pid_vnr(t); + + cr_hdr_init(cr_hdr, CR_HDR_CPU, sizeof(*cpu_hdr), parent); + + /* pt_regs: GPRs, MSR, etc */ + pt_regs = task_pt_regs(t); + cpu_hdr-pt_regs = *pt_regs; + + /* FP state */ + memcpy(cpu_hdr-fpr, t-thread.fpr, sizeof(cpu_hdr-fpr)); As note above, is sizeof(cpu_hdr-fpr) the same on all chips ? It can differ depending on kernel configuration. +/* restart APIs */ + The restart APIs belong in a separate file: arch/powerpc/mm/restart.c Explain why, please? This isn't a lot of code, and it seems likely that checkpoint and restart paths will share data structures and tend to be modified together over time. + pr_debug(%s: unexpected thread_hdr contents: 0x%lx\n, +__func__, (unsigned long)thread_hdr-unimplemented); Given the macro for 'pr_fmt' in include/linux/checkpoint.h, the use of __func__ is redunant. It seems to me that defining your own pr_fmt in a public header like that is inappropriate, or at least unconventional. Any file that happens to include linux/checkpoint.h will have any prior definitions of pr_fmt overridden, no? + regs = task_pt_regs(current); + *regs = cpu_hdr-pt_regs; + + regs-msr = sanitize_msr(regs-msr); + + /* FP state */ + memcpy(current-thread.fpr, cpu_hdr-fpr, sizeof(current-thread.fpr)); + current-thread.fpscr.val = cpu_hdr-fpscr; + current-thread.fpexc_mode = cpu_hdr-fpexc_mode; + + /* debug registers */ + current-thread.dabr = cpu_hdr-dabr; I'm unfamiliar with powerpc; is it necessary to sanitize any of the registers here ? For instance, can the user cause harm with specially crafted values of some registers ? I had this in mind with the treatment of MSR, but I'll check on the others, thanks. +int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int rparent) +{ + struct cr_hdr_mm_context *mm_hdr; + int ret; + + mm_hdr = cr_hbuf_get(ctx, sizeof(*mm_hdr)); + if (!mm_hdr) + return -ENOMEM; + + ret = cr_read_obj_type(ctx, mm_hdr, sizeof(*mm_hdr), + CR_HDR_MM_CONTEXT); + if (ret != rparent) + goto out; Seems like 'ret' isn't set to an error value if the 'goto' executes. It returns whatever error value cr_read_obj_type() returns. Hrm. I guess if the image is garbage, cr_read_obj_type can potentially return a non-error value that still isn't the desired value, is that right? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/1] powerpc: Fix partition migration hang under load
Brian King wrote: While testing partition migration with heavy CPU load using shared processors, it was observed that sometimes the migration would never complete and would appear to hang. Currently, the migration code assumes that if H_SUCCESS is returned from the H_JOIN then the migration is complete and the processor is waking up on the target system. If there was an outstanding PROD to the processor when the H_JOIN is called, however, it will return H_SUCCESS on the source system Hmm, did you determine where that outstanding H_PROD is coming from? AFAICT this is the only code which uses that hcall, and all processors should have consumed their prods from one migration before another migration can commence. Regardless, ACK -- if we were to add another H_PROD call site (or if there's one I missed) this would be necessary anyway. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/3] powerpc: bare minimum checkpoint/restart implementation
Oren Laadan wrote: Nathan Lynch wrote: Oren Laadan wrote: Nathan Lynch wrote: + pr_debug(%s: unexpected thread_hdr contents: 0x%lx\n, + __func__, (unsigned long)thread_hdr-unimplemented); Given the macro for 'pr_fmt' in include/linux/checkpoint.h, the use of __func__ is redunant. It seems to me that defining your own pr_fmt in a public header like that is inappropriate, or at least unconventional. Any file that happens to include linux/checkpoint.h will have any prior definitions of pr_fmt overridden, no? Hmmm.. didn't think of it this way. Using the pr_debug() there was yet another feedback from LKML, and it seemed reasonable to me. Can you think of a case where linux/checkpoint.h will happen to be included in checkpoint-related code ? (Assume you meant included in checkpoint-unrelated code) I could see checkpoint.h being included by files that don't exclusively deal with C/R. If you want a uniform debug statement format for C/R-related code, that's fine, but this isn't the way to do it. See the existing users (almost all in drivers/s390). ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/3] powerpc: bare minimum checkpoint/restart implementation
Nathan Lynch n...@pobox.com wrote: Oren Laadan wrote: Nathan Lynch wrote: What doesn't work: * restarting a 32-bit task from a 64-bit task and vice versa Is there a test to bail if we attempt to checkpoint such tasks ? No, but I'll add one if it looks too hard to fix for the next round. Unfortunately, adding a check for this is hard. The point of no return in the restart path is cr_read_mm, which tears down current's address space. cr_read_mm runs way before cr_read_cpu, which is the only restart method I've implemented for powerpc so far. So, checking for this condition in cr_read_cpu is too late if I want restart(2) to return an error and leave the caller's memory map intact. (And I do want this: restart should be as robust as execve.) Well okay then, cr_read_head_arch seems to be the right place in the restart sequence for the architecture code to handle this. However, cr_write_head_arch (which produces the buffer that cr_read_head_arch consumes) is not provided a reference to the task to be checkpointed, nor can it assume that it's operating on current. I need a reference to a task before I can determine whether it's running in 32- or 64-bit mode, or using the FPU, Altivec, SPE, whatever. In any case, mixing 32- and 64-bit tasks across restart is something I eventually want to support, not reject. But the problem I've outlined applies to FPU state and vector extensions (VMX, SPE), as well as sanity-checking debug register (DABR) contents. We'll need to be able to error out gracefully from restart when a checkpoint image specifies a feature unsupported by the current kernel or hardware. But I don't see how to do it with the current architecture. Am I missing something? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[RFC/PATCH] powerpc: provide APIs for validating and updating DABR
A checkpointed task image may specify a value for the DABR (Data Access Breakpoint Register). The restart code needs to validate this value before making any changes to the current task. ptrace_set_debugreg encapsulates the bounds checking and platform dependencies of programming the DABR. Split this into validate (debugreg_valid) and update (debugreg_update) functions, and make them available for use outside of the ptrace code. Also ptrace_set_debugreg has extern linkage, but no users outside of ptrace.c. Make it static. Signed-off-by: Nathan Lynch n...@pobox.com --- arch/powerpc/include/asm/ptrace.h |6 +++ arch/powerpc/kernel/ptrace.c | 68 2 files changed, 44 insertions(+), 30 deletions(-) Is something like this okay to carry in the checkpoint/restart patches? diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h index c9c678f..1b3a5f0 100644 --- a/arch/powerpc/include/asm/ptrace.h +++ b/arch/powerpc/include/asm/ptrace.h @@ -81,6 +81,8 @@ struct pt_regs { #ifndef __ASSEMBLY__ +#include linux/types.h + #define instruction_pointer(regs) ((regs)-nip) #define user_stack_pointer(regs) ((regs)-gpr[1]) #define regs_return_value(regs) ((regs)-gpr[3]) @@ -138,6 +140,10 @@ do { \ extern void user_enable_single_step(struct task_struct *); extern void user_disable_single_step(struct task_struct *); +/* for reprogramming DABR/DAC during restart of a checkpointed task */ +extern bool debugreg_valid(unsigned long val); +extern void debugreg_update(struct task_struct *task, unsigned long val); + #endif /* __ASSEMBLY__ */ #endif /* __KERNEL__ */ diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c index 3635be6..60259c6 100644 --- a/arch/powerpc/kernel/ptrace.c +++ b/arch/powerpc/kernel/ptrace.c @@ -735,22 +735,13 @@ void user_disable_single_step(struct task_struct *task) clear_tsk_thread_flag(task, TIF_SINGLESTEP); } -int ptrace_set_debugreg(struct task_struct *task, unsigned long addr, - unsigned long data) +bool debugreg_valid(unsigned long val) { - /* For ppc64 we support one DABR and no IABR's at the moment (ppc64). -* For embedded processors we support one DAC and no IAC's at the -* moment. -*/ - if (addr 0) - return -EINVAL; - /* The bottom 3 bits in dabr are flags */ - if ((data ~0x7UL) = TASK_SIZE) - return -EIO; + if ((val ~0x7UL) = TASK_SIZE) + return false; #ifndef CONFIG_BOOKE - /* For processors using DABR (i.e. 970), the bottom 3 bits are flags. * It was assumed, on previous implementations, that 3 bits were * passed together with the data address, fitting the design of the @@ -764,47 +755,64 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned long addr, */ /* Ensure breakpoint translation bit is set */ - if (data !(data DABR_TRANSLATION)) - return -EIO; - - /* Move contents to the DABR register */ - task-thread.dabr = data; - -#endif -#if defined(CONFIG_BOOKE) - + if (val !(val DABR_TRANSLATION)) + return false; +#else /* As described above, it was assumed 3 bits were passed with the data * address, but we will assume only the mode bits will be passed * as to not cause alignment restrictions for DAC-based processors. */ + /* Read or Write bits must be set */ + if (!(val 0x3UL)) + return -EINVAL; +#endif + return true; +} + +void debugreg_update(struct task_struct *task, unsigned long val) +{ +#ifndef CONFIG_BOOKE + task-thread.dabr = val; +#else /* DAC's hold the whole address without any mode flags */ - task-thread.dabr = data ~0x3UL; + task-thread.dabr = val ~0x3UL; if (task-thread.dabr == 0) { task-thread.dbcr0 = ~(DBSR_DAC1R | DBSR_DAC1W | DBCR0_IDM); task-thread.regs-msr = ~MSR_DE; - return 0; } - /* Read or Write bits must be set */ - - if (!(data 0x3UL)) - return -EINVAL; - /* Set the Internal Debugging flag (IDM bit 1) for the DBCR0 register */ task-thread.dbcr0 = DBCR0_IDM; /* Check for write and read flags and set DBCR0 accordingly */ - if (data 0x1UL) + if (val 0x1UL) task-thread.dbcr0 |= DBSR_DAC1R; - if (data 0x2UL) + if (val 0x2UL) task-thread.dbcr0 |= DBSR_DAC1W; task-thread.regs-msr |= MSR_DE; #endif +} + +static int ptrace_set_debugreg(struct task_struct *task, unsigned long addr, + unsigned long data) +{ + /* For ppc64 we support one DABR and no IABR's at the moment (ppc64
[PATCH 1/3 v2] powerpc: heckpoint/restart implementation
On Tue, 17 Feb 2009 01:03:55 -0600 Nathan Lynch n...@pobox.com wrote: Nathan Lynch n...@pobox.com wrote: Oren Laadan wrote: Nathan Lynch wrote: What doesn't work: * restarting a 32-bit task from a 64-bit task and vice versa Is there a test to bail if we attempt to checkpoint such tasks ? No, but I'll add one if it looks too hard to fix for the next round. Unfortunately, adding a check for this is hard. The point of no return in the restart path is cr_read_mm, which tears down current's address space. cr_read_mm runs way before cr_read_cpu, which is the only restart method I've implemented for powerpc so far. So, checking for this condition in cr_read_cpu is too late if I want restart(2) to return an error and leave the caller's memory map intact. (And I do want this: restart should be as robust as execve.) Well okay then, cr_read_head_arch seems to be the right place in the restart sequence for the architecture code to handle this. However, cr_write_head_arch (which produces the buffer that cr_read_head_arch consumes) is not provided a reference to the task to be checkpointed, nor can it assume that it's operating on current. I need a reference to a task before I can determine whether it's running in 32- or 64-bit mode, or using the FPU, Altivec, SPE, whatever. In any case, mixing 32- and 64-bit tasks across restart is something I eventually want to support, not reject. But the problem I've outlined applies to FPU state and vector extensions (VMX, SPE), as well as sanity-checking debug register (DABR) contents. We'll need to be able to error out gracefully from restart when a checkpoint image specifies a feature unsupported by the current kernel or hardware. But I don't see how to do it with the current architecture. Am I missing something? Anyway, here's what I have coded up in response to all the feedback (thanks!) But all the error/compatibility checking I added doesn't seem that useful unless the above is addressed... Support for checkpointing and restarting GPRs, FPU state, DABR, and Altivec state. The portion of the checkpoint image manipulated by this code begins with a bitmask of features indicating the various contexts saved. Fields in image that can vary depending on kernel configuration (e.g. FP regs due to VSX) have their sizes explicitly recorded, except for GPRS, so migrating between ppc32 and ppc64 won't work yet. The restart code ensures that the task is not modified until the checkpoint image is validated against the current kernel configuration and hardware features (e.g. can't restart a task using Altivec on non-Altivec systems). What works: * self and external checkpoint of simple (single thread, one open file) 32- and 64-bit processes on a ppc64 kernel What doesn't work: * restarting a 32-bit task from a 64-bit task and vice versa Untested: * ppc32 (but it builds) Signed-off-by: Nathan Lynch n...@pobox.com --- This depends on powerpc: provide APIs for validating and updating DABR which I posted to linuxppc-dev on 17 Feb: http://patchwork.ozlabs.org/patch/23311/ v2 changelog: - use feature bitmask in checkpoint image as suggested by Ben - fail restart if checkpoint image specifies unsupported features - handle Altivec/VMX and SPE register state - validate DABR value from checkpoint image - fail restart on differing FP register set sizes (can happen depending on CONFIG_VSX) - fail restart on 32-/64-bit mismatch between image and restarting task - don't write meaningless data in unimplemented arch callbacks - kill cr_hdr_init helper arch/powerpc/include/asm/checkpoint_hdr.h | 15 + arch/powerpc/mm/Makefile |1 + arch/powerpc/mm/checkpoint.c | 482 + 3 files changed, 498 insertions(+), 0 deletions(-) create mode 100644 arch/powerpc/include/asm/checkpoint_hdr.h create mode 100644 arch/powerpc/mm/checkpoint.c diff --git a/arch/powerpc/include/asm/checkpoint_hdr.h b/arch/powerpc/include/asm/checkpoint_hdr.h new file mode 100644 index 000..9f0d099 --- /dev/null +++ b/arch/powerpc/include/asm/checkpoint_hdr.h @@ -0,0 +1,15 @@ +#ifndef __ASM_PPC_CKPT_HDR_H +#define __ASM_PPC_CKPT_HDR_H +/* + * Checkpoint/restart - architecture specific headers ppc + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* nothing to see here */ + +#endif /* __ASM_PPC_CKPT_HDR__H */ diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile index e7392b4..8a523a0 100644 --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -24,3 +24,4 @@ obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o obj-$(CONFIG_PPC_MM_SLICES)+= slice.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o obj-$(CONFIG_PPC_SUBPAGE_PROT) += subpage-prot.o +obj
Re: [PATCH 1/3] powerpc: bare minimum checkpoint/restart implementation
On Tue, 24 Feb 2009 13:58:26 -0600 Serge E. Hallyn se...@us.ibm.com wrote: Quoting Nathan Lynch (n...@pobox.com): Nathan Lynch n...@pobox.com wrote: Oren Laadan wrote: Nathan Lynch wrote: What doesn't work: * restarting a 32-bit task from a 64-bit task and vice versa Is there a test to bail if we attempt to checkpoint such tasks ? No, but I'll add one if it looks too hard to fix for the next round. Unfortunately, adding a check for this is hard. The point of no return in the restart path is cr_read_mm, which tears down current's address space. cr_read_mm runs way before cr_read_cpu, which is the only restart method I've implemented for powerpc so far. So, checking for this condition in cr_read_cpu is too late if I want restart(2) to return an error and leave the caller's memory map intact. (And I do want this: restart should be as robust as execve.) Well okay then, cr_read_head_arch seems to be the right place in the restart sequence for the architecture code to handle this. However, cr_write_head_arch (which produces the buffer that cr_read_head_arch consumes) is not provided a reference to the task to be checkpointed, nor can it assume that it's operating on current. I need a reference to a task before I can determine whether it's running in 32- or 64-bit mode, or using the FPU, Altivec, SPE, whatever. In any case, mixing 32- and 64-bit tasks across restart is something I eventually want to support, not reject. But the problem I've outlined applies to FPU state and vector extensions (VMX, SPE), as well as sanity-checking debug register (DABR) contents. We'll need to be able to error out gracefully from restart when a checkpoint image specifies a feature unsupported by the current kernel or hardware. But I don't see how to do it with the current architecture. Am I missing something? I suspect I can guess the response to this suggestion, but how about we accept that if sys_restart() fails due to something like this, the task is lost and can't exit gracefully? In the short term it might be necessary. But the restart code should forcibly kill the task instead of returning an error back up to userspace in this case. Once the memory map of the process has been altered, there is no point in allowing it to continue (and likely dump a useless core). Btw, this failure mode seems to apply when cr_read_files() fails, too... But in the long term, things need to be more robust (e.g. restart(2) returns ENOEXEC without messing with current-mm). I think it's worth looking at how execve operates... if I understand correctly, it sets up a new mm_struct disconnected from the current task and activates it at the last moment. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/3] powerpc: bare minimum checkpoint/restart implementation
Oren Laadan or...@cs.columbia.edu wrote: Nathan Lynch wrote: Nathan Lynch n...@pobox.com wrote: Oren Laadan wrote: Nathan Lynch wrote: What doesn't work: * restarting a 32-bit task from a 64-bit task and vice versa Is there a test to bail if we attempt to checkpoint such tasks ? No, but I'll add one if it looks too hard to fix for the next round. Unfortunately, adding a check for this is hard. The point of no return in the restart path is cr_read_mm, which tears down current's address space. cr_read_mm runs way before cr_read_cpu, which is the only restart method I've implemented for powerpc so far. So, checking for this condition in cr_read_cpu is too late if I want restart(2) to return an error and leave the caller's memory map intact. (And I do want this: restart should be as robust as execve.) In the case of restarting a container, I think it's ok if a restarting tasks dies in an ugly way -- this will be observed and handled by the initiating task outside the container, which will gracefully report to the caller/user. How would task exit be observed? Are all tasks in a restarted container guaranteed to be children (in the sense that wait(2) would work) of the initiating task? Even if you close this hole, then any other failure later on during restart - even a failure to allocate kernel memory due to memory pressure, will give that undesired effect that you are trying to avoid. Kernel memory allocation failure is not the kind of problem I'm trying to address. I am trying to address the case of restarting a checkpoint image that needs features that are not present, where the set of features used by the checkpoint image can be compared against the set of features the platform provides. That said, any difference in the architecture that may cause restart to fail is probably best placed in cr_write_head_arch. I think I explained in my earlier mail why the current implementation's cr_write_head_arch doesn't help in this case: Well okay then, cr_read_head_arch seems to be the right place in the restart sequence for the architecture code to handle this. However, cr_write_head_arch (which produces the buffer that cr_read_head_arch consumes) is not provided a reference to the task to be checkpointed, nor can it assume that it's operating on current. I need a reference to a task before I can determine whether it's running in 32- or 64-bit mode, or using the FPU, Altivec, SPE, whatever. In any case, mixing 32- and 64-bit tasks across restart is something I eventually want to support, not reject. But the problem I've outlined applies to FPU state and vector extensions (VMX, SPE), as well as sanity-checking debug register (DABR) contents. We'll need to be able to error out gracefully from restart when a checkpoint image specifies a feature unsupported by the current kernel or hardware. But I don't see how to do it with the current architecture. Am I missing something? More specifically, I envision restart to work like this: 1) user invokes user-land utility (e.g. cr --restart ... 2) 'cr' will create a new container 3) 'cr' will start a child in that container 4) child will create rest of tree (in kernel or in user space - tbd) 5) each task in that tree will restore itself 6) 'cr' monitors this process 7) if all goes well - 'cr' report ok. 8) if something goes bad, 'cr' notices and notifies caller/user Again, how would 'cr' obtain exit status for these tasks, and how would it distinguish failure from normal operation? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] Quieten arch/powerpc in a allmodconfig build.
Michael Ellerman mich...@ellerman.id.au wrote: On Wed, 2009-04-08 at 15:51 +1000, Tony Breeds wrote: On Wed, Apr 08, 2009 at 03:08:55PM +1000, Michael Ellerman wrote: The getter routines in here could really multiplex their return values with a negative error code, which I generally prefer, but this works I guess. I was hoping someone would notice and suggest it. tag you're it! I meant we /could/ change them, but we could also leave them, it's a bit of a coin-flip which is better. Nathan might have an opinion? I think I had some reason for doing it this way, but I'm drawing a blank right now. In the meantime, can someone post the warnings that gcc emits for cacheinfo.c, as well as the gcc version? I can't reproduce them with 4.3.2 on Fedora. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] Quieten arch/powerpc in a allmodconfig build.
Tony Breeds t...@bakeyournoodle.com wrote: On Wed, Apr 08, 2009 at 01:47:36PM -0500, Nathan Lynch wrote: I think I had some reason for doing it this way, but I'm drawing a blank right now. In the meantime, can someone post the warnings that gcc emits for cacheinfo.c, as well as the gcc version? I can't reproduce them with 4.3.2 on Fedora. --- [t...@thor ~]$ egrep cacheinfo tmp/build.log /scratch/tony/working/arch/powerpc/kernel/cacheinfo.c: In function 'associativity_show': /scratch/tony/working/arch/powerpc/kernel/cacheinfo.c:562: warning: 'associativity' may be used uninitialized in this function /scratch/tony/working/arch/powerpc/kernel/cacheinfo.c: In function 'size_show': /scratch/tony/working/arch/powerpc/kernel/cacheinfo.c:513: warning: 'size_kb' may be used uninitialized in this function Thanks. So I think I've convinced myself that the warnings are incorrect and that uninitialized use is not possible. But I find it odd that gcc gives warnings for these sites but not others in the file that use the same idiom (e.g. line_size_show, nr_sets_show). I'd guess that inlining is implicated somehow. Would I be justified in worrying that this version of gcc is generating incorrect code? If not, then I'm fine with the uninitialized_var() changes, but do please include the warnings and the compiler version in the changelog. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 5/5] proc: export more page flags in /proc/kpageflags
Wu Fengguang fengguang...@intel.com writes: On Wed, Apr 29, 2009 at 05:32:44AM +0800, Andrew Morton wrote: On Tue, 28 Apr 2009 09:09:12 +0800 Wu Fengguang fengguang...@intel.com wrote: +/* + * Kernel flags are exported faithfully to Linus and his fellow hackers. + * Otherwise some details are masked to avoid confusing the end user: + * - some kernel flags are completely invisible + * - some kernel flags are conditionally invisible on their odd usages + */ +#ifdef CONFIG_DEBUG_KERNEL +static inline int genuine_linus(void) { return 1; } Although he's a fine chap, the use of the _linus tag isn't terribly clear (to me). I think what you're saying here is that this enables kernel-developer-only features, yes? Yes. If so, perhaps we could come up with an identifier which expresses that more clearly. But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL for _some_ reason, so what's the point? At the least, it has not always been so... Good point! I can confirm my debian has CONFIG_DEBUG_KERNEL=Y! I can confirm mine does not. etch-i386:~# uname -a Linux etch-i386 2.6.18-6-686 #1 SMP Fri Dec 12 16:48:28 UTC 2008 i686 GNU/Linux etch-i386:~# grep DEBUG_KERNEL /boot/config-2.6.18-6-686 # CONFIG_DEBUG_KERNEL is not set For what that's worth. It is preferable that we always implement the same interface for all Kconfig settings. If this exposes information which is confusing or not useful to end-users then so be it - we should be able to cover that in supporting documentation. My original patch takes that straightforward manner - and I still like it. I would be very glad to move the filtering code from kernel to user space. The use of more obscure flags could be discouraged by _not_ documenting them. A really curious user is encouraged to refer to the code for the exact meaning (and perhaps become a kernel developer ;-) Also, as mentioned in the other email, it would be good if we were to publish a little userspace app which people can use to access this raw data. We could give that application an `--i-am-a-kernel-developer' option! OK. I'll include page-types.c in the next take. +#else +static inline int genuine_linus(void) { return 0; } +#endif This isn't an appropriate use of CONFIG_DEBUG_KERNEL. DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_ debugging features. The way you've used it here, if the person who is configuring the kernel wants to enable any other completely-unrelated debug feature, they have to enable DEBUG_KERNEL first. But when they do that, they unexpectedly alter the behaviour of pagemap! There are two other places where CONFIG_DEBUG_KERNEL affects code generation in .c files: arch/parisc/mm/init.c and arch/powerpc/kernel/sysfs.c. These are both wrong, and need slapping ;) (add cc to related maintainers) I assume I was cc'd because I've changed arch/powerpc/kernel/sysfs.c a couple of times in the last year, but I can't claim to maintain that code. I'm pretty sure I haven't touched the code in question in this discussion. I've cc'd linuxppc-dev. CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means #ifdef CONFIG_DEBUG_KERNEL == #if 1 as the following patch demos. Now it becomes obviously silly. Sure, #if 1 is usually silly. But if the point is that DEBUG_KERNEL is not supposed to directly affect code generation, then I see two options for powerpc: - remove the #ifdef CONFIG_DEBUG_KERNEL guards from arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima sysfs attributes, or - define a new config symbol which governs whether those attributes are enabled, and make it depend on DEBUG_KERNEL --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -212,19 +212,19 @@ static SYSDEV_ATTR(purr, 0600, show_purr, store_purr); #endif /* CONFIG_PPC64 */ #ifdef HAS_PPC_PMC_PA6T SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0); SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1); SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2); SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3); SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4); SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5); -#ifdef CONFIG_DEBUG_KERNEL +#if 1 SYSFS_PMCSETUP(hid0, SPRN_HID0); SYSFS_PMCSETUP(hid1, SPRN_HID1); SYSFS_PMCSETUP(hid4, SPRN_HID4); SYSFS_PMCSETUP(hid5, SPRN_HID5); SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0); SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1); SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2); SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3); SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4); @@ -282,19 +282,19 @@ static struct sysdev_attribute classic_pmc_attrs[] = { static struct sysdev_attribute pa6t_attrs[] = { _SYSDEV_ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0), _SYSDEV_ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1), _SYSDEV_ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0), _SYSDEV_ATTR(pmc1, 0600, show_pa6t_pmc1, store_pa6t_pmc1),
Re: Kernel oops while duming user core.
Rune Torgersen wrote: Kumar Gala wrote: Can you git-bisect to narrow this down further. Not easilly, as the board port to arch/powerpc was done on 2.6.24-rc7 and up. Is there an somewhat esy way in git to apply the differences from master branch to our board branch to a branch created by bisect? And I don't even know where this started to happen. Would trying arch/ppc help any? I have our arch/ppc port in a semiworking state for kernels up to 2.6.23 Well, we know this happens on other 32-bit powerpc machines (pmac at least)... perhaps someone could arrange to bisect on a machine that works with older powerpc kernels (assuming they have a good repro case). ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Kernel oops while duming user core.
Scott Wood wrote: On Thu, Jan 31, 2008 at 11:40:04AM -0600, Rune Torgersen wrote: Unable to handle kernel paging request for data at address 0x48024000 Faulting instruction address: 0xc000f0a0 Oops: Kernel access of bad area, sig: 11 [#1] PREEMPT Innovative Systems ApMax Does it happen without preempt? Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7 drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P) NIP: c000f0a0 LR: c0011fec CTR: 0080 REGS: eebe9b70 TRAP: 0300 Tainted: P (2.6.24-test) Does it happen without the modules? I doubt the modules are the problem; there was a practically identical report from someone with an untainted 2.6.24-rc kernel a few weeks ago (see my first reply to Rune). MSR: 9032 EE,ME,IR,DR CR: 24004442 XER: DAR: 48024000, DSISR: 2000 Hmm, this doesn't look like a valid DSISR, so I'm guessing this was a TLB miss that got redirected to DataAccess (or is there something that causes DSRISR[2] to be set on 8280? I didn't see anything in the manual...). However, SRR1 in that case seems to indicate a store, which dcbst shouldn't generate (except on 8xx, according to the comment in update_mmu_cache). Do you have a simple test case that we could try to reproduce? I tried a simple core dump on an 8280, and it worked. Is the crashing program multithreaded? The first report had firefox triggering the oops. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Kernel oops while duming user core.
Rune Torgersen wrote: Hi I get the following kernel core while a user program I have is dumping core. Any DIeas at what to look for? (this is runnign 2.6.24, arch/powerpc on a 8280) When runnign the program on 2.6.18 arch/ppc, the program gets a sig 11 and dumps core. On 2.6.24, I ghet the kernel oops, and then the program hangs sround forever and is unkillable. Hmm, this is the second report of 2.6.24 crashing in __flush_dcache_icache during a core dump; see: http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html Is this easily recreatable? Unable to handle kernel paging request for data at address 0x48024000 Faulting instruction address: 0xc000ef88 Oops: Kernel access of bad area, sig: 11 [#1] PREEMPT Innovative Systems ApMax Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7 drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P) NIP: c000ef88 LR: c0012180 CTR: 0080 REGS: eebc9b70 TRAP: 0300 Tainted: P (2.6.24) MSR: 9032 EE,ME,IR,DR CR: 24004442 XER: DAR: 48024000, DSISR: 2000 TASK = eebac3c0[3131] 'armd' THREAD: eebc8000 GPR00: ee9b7d00 eebc9c20 eebac3c0 48024000 0080 399a4181 48024000 GPR08: 399a4181 ee9b7d00 c200 44004422 10100f38 ee82fc00 bfff GPR16: ef377060 0030 ee9b7d00 eebc9cdc 0011 eebc9cd8 eeb96480 GPR24: ee9b7d00 399a4181 48024000 eeb9a370 eeb9a370 399a4181 48024000 c2733480 NIP [c000ef88] __flush_dcache_icache+0x14/0x40 LR [c0012180] update_mmu_cache+0x74/0x114 Call Trace: [eebc9c20] [eebc8000] 0xeebc8000 (unreliable) [eebc9c40] [c005d060] handle_mm_fault+0x630/0xbc0 [eebc9c80] [c005d9e4] get_user_pages+0x3f4/0x4fc [eebc9cd0] [c00aa7c4] elf_core_dump+0x9a4/0xc5c [eebc9d60] [c00779e4] do_coredump+0x6e0/0x748 [eebc9e50] [c002a5b0] get_signal_to_deliver+0x40c/0x45c [eebc9e80] [c0008ce8] do_signal+0x50/0x294 [eebc9f40] [c000fb98] do_user_signal+0x74/0xc4 --- Exception: 300 at 0x10044efc LR = 0x10044ec0 Instruction dump: 4d820020 7c8903a6 7c001bac 38630020 4200fff8 7c0004ac 4e800020 6000 54630026 38800080 7c8903a6 7c661b78 7c00186c 38630020 4200fff8 7c0004ac ---[ end trace 97db37eaf213da3c ]--- note: armd[3131] exited with preempt_count 2 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 0/5] ehea checkpatch fixups
Doug Maxey wrote: A small set of fixups for checkpatch. Based on upstream pull from a few hours ago. Er, ehea doesn't even build right now[1], maybe that could get fixed first? :-) [1] http://lkml.org/lkml/2008/1/28/337 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: asm-offsets.c
Christoph Hellwig wrote: On Wed, Feb 06, 2008 at 11:43:41PM -0500, Sean MacLennan wrote: I just did a git pull of Josh's tree, and arch/powerpc/kernel/asm-offsets.c does not compile. I have only been glossing over the linuxppc-dev emails, so forgive me if this already came up. I have the same problem with current powerpc.git. What are the exact error and config? I haven't been able to reproduce this with several defconfigs and toggling CONFIG_HIGH_RES_TIMERS. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: asm-offsets.c
Sean MacLennan wrote: Christoph Hellwig wrote: On Thu, Feb 07, 2008 at 12:49:34AM -0600, Nathan Lynch wrote: Christoph Hellwig wrote: On Wed, Feb 06, 2008 at 11:43:41PM -0500, Sean MacLennan wrote: I just did a git pull of Josh's tree, and arch/powerpc/kernel/asm-offsets.c does not compile. I have only been glossing over the linuxppc-dev emails, so forgive me if this already came up. I have the same problem with current powerpc.git. What are the exact error and config? I haven't been able to reproduce this with several defconfigs and toggling CONFIG_HIGH_RES_TIMERS. Same error. I did a search for CLOCK_REALTIME_RES and it is not defined in any *.[chS] file. CLOCK_REALTIME_RES is defined in include/asm/asm-offsets.h, which is generated from asm-offsets.c as part of the build. Your config builds fine here, so please post the actual error that you are seeing during the build? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/2][OF] Add of_device_is_disabled function
Josh Boyer wrote: On Sat, 23 Feb 2008 15:58:23 -0600 Josh Boyer [EMAIL PROTECTED] wrote: IEEE 1275 defined a standard status property to indicate the operational status of a device. The property has four possible values: okay, disabled, fail, fail-xxx. The absence of this property means the operational status of the device is unknown or okay. This adds a function called of_device_is_disabled that checks to see if a node has the status property set to disabled. This can be quite useful for devices that may be present but disabled due to pin sharing, etc. Talking with Ben H a bit, he suggested to reverse this API. Basically, create an of_device_is_available that returns 1 if the status property is completely missing, or if it's set to okay or ok. The latter is to cope with some broken firmwares. I agree with Ben's suggestion. The rtas_pci and eeh code could be converted to use this, which gives a net savings of a few bytes with ppc64_defconfig: diff --git a/arch/powerpc/kernel/rtas_pci.c b/arch/powerpc/kernel/rtas_pci.c index 433a0a0..39a752b 100644 --- a/arch/powerpc/kernel/rtas_pci.c +++ b/arch/powerpc/kernel/rtas_pci.c @@ -56,21 +56,6 @@ static inline int config_access_valid(struct pci_dn *dn, int where) return 0; } -static int of_device_available(struct device_node * dn) -{ -const char *status; - -status = of_get_property(dn, status, NULL); - -if (!status) -return 1; - -if (!strcmp(status, okay)) -return 1; - -return 0; -} - int rtas_read_config(struct pci_dn *pdn, int where, int size, u32 *val) { int returnval = -1; @@ -117,7 +102,7 @@ static int rtas_pci_read_config(struct pci_bus *bus, for (dn = busdn-child; dn; dn = dn-sibling) { struct pci_dn *pdn = PCI_DN(dn); if (pdn pdn-devfn == devfn -of_device_available(dn)) +of_device_is_available(dn)) return rtas_read_config(pdn, where, size, val); } @@ -164,7 +149,7 @@ static int rtas_pci_write_config(struct pci_bus *bus, for (dn = busdn-child; dn; dn = dn-sibling) { struct pci_dn *pdn = PCI_DN(dn); if (pdn pdn-devfn == devfn -of_device_available(dn)) +of_device_is_available(dn)) return rtas_write_config(pdn, where, size, val); } return PCIBIOS_DEVICE_NOT_FOUND; diff --git a/arch/powerpc/platforms/pseries/eeh.c b/arch/powerpc/platforms/pseries/eeh.c index 9eb539e..550b2f7 100644 --- a/arch/powerpc/platforms/pseries/eeh.c +++ b/arch/powerpc/platforms/pseries/eeh.c @@ -945,7 +945,6 @@ static void *early_enable_eeh(struct device_node *dn, void *data) unsigned int rets[3]; struct eeh_early_enable_info *info = data; int ret; - const char *status = of_get_property(dn, status, NULL); const u32 *class_code = of_get_property(dn, class-code, NULL); const u32 *vendor_id = of_get_property(dn, vendor-id, NULL); const u32 *device_id = of_get_property(dn, device-id, NULL); @@ -959,8 +958,8 @@ static void *early_enable_eeh(struct device_node *dn, void *data) pdn-eeh_freeze_count = 0; pdn-eeh_false_positives = 0; - if (status strncmp(status, ok, 2) != 0) - return NULL;/* ignore devices with bad status */ + if (!of_device_is_available(dn)) + return NULL; /* Ignore bad nodes. */ if (!class_code || !vendor_id || !device_id) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: copy_from_user problem
Maynard Johnson wrote: static long lib_addr; module_param(lib_addr, long, 0); Should be unsigned long? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: copy_from_user problem
Nathan Lynch wrote: Maynard Johnson wrote: static long lib_addr; module_param(lib_addr, long, 0); Should be unsigned long? ulong, rather, but that doesn't fix it. In any case, lib_addr is a user virtual address; doesn't the kernel need to do get_user_pages or some such to get at arbitrary process memory? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 3/3] ppc64-specific memory notifier support
Badari Pulavarty wrote: +static struct notifier_block pseries_smp_nb = { Rename this to pseries_mem_nb? + .notifier_call = pseries_memory_notifier, +}; + +static int __init pseries_memory_hotplug_init(void) +{ + if (firmware_has_feature(FW_FEATURE_LPAR)) + pSeries_reconfig_notifier_register(pseries_smp_nb); + + return 0; +} +arch_initcall(pseries_memory_hotplug_init); arch_initcall doesn't seem appropriate. __initcall should be fine. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 0/3] hotplug memory remove updates
Badari Pulavarty wrote: Hi Paul, Here are the hotplug memory remove updates for 2.6.25-rc2-mm1. How have these been tested? Have you initiated a memory remove operation from the HMC? That's the only way to catch some bugs... ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 2/2 v2] [POWERPC] Ignore disabled serial ports
Josh Boyer wrote: On Mon, 3 Mar 2008 13:09:25 -0600 Scott Wood [EMAIL PROTECTED] wrote: On Sun, Mar 02, 2008 at 10:43:17PM -0600, Josh Boyer wrote: On Mon, 3 Mar 2008 04:43:42 +0100 Arnd Bergmann [EMAIL PROTECTED] wrote: I wonder whether we should move the check for used-by-rtas into the of_device_is_available function. I understand that used-by-rtas is another way of expressing the idea that the kernel is not supposed to access the specific device. In this case, the device is physically present, but is not available to the OS. I'd rather not at the moment. My intention was to only look at the status property for now. I'd like to avoid this function growing into a huge switch statement for $random_firmware's way of flagging something as don't touch. Better that than having the huge list of tests in every driver... Perhaps. This isn't set in stone. I'd rather get what's in the patch in-tree now and massage it as we go. Otherwise this bike shed will wind up being rainbow colored yet totally useless because it's a never-ending patch rework. I agree. Josh's patch is immediately useful to other code as-is. used-by-rtas is powerpc-specific and doesn't belong in drivers/of IMO. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 3/3] ppc64-specific memory notifier support
Michael Ellerman wrote: On Thu, 2008-02-28 at 08:46 -0800, Badari Pulavarty wrote: Hotplug memory notifier for ppc64. This gets invoked by writing the device-node that needs to be removed to /proc/ppc64/ofdt. We need to adjust the sections and remove sysfs entries by calling __remove_pages(). Then call arch specific code to get rid of htab mappings for the section of memory. Signed-off-by: Badari Pulavarty [EMAIL PROTECTED] --- arch/powerpc/platforms/pseries/Makefile |1 arch/powerpc/platforms/pseries/hotplug-memory.c | 98 2 files changed, 99 insertions(+) Index: linux-2.6.25-rc2/arch/powerpc/platforms/pseries/hotplug-memory.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.25-rc2/arch/powerpc/platforms/pseries/hotplug-memory.c 2008-02-28 08:20:14.0 -0800 + +static struct notifier_block pseries_smp_nb = { + .notifier_call = pseries_memory_notifier, +}; + +static int __init pseries_memory_hotplug_init(void) +{ + if (firmware_has_feature(FW_FEATURE_LPAR)) + pSeries_reconfig_notifier_register(pseries_smp_nb); + + return 0; +} +arch_initcall(pseries_memory_hotplug_init); This is going to fire on non-pseries LPAR platforms, like iSeries and PS3. Which is not what you want I think. Well, the notifier will be registered, yes, but it will never be called because that path is reachable only from a write to /proc/ppc64/ofdt, which is not created on non-pseries. Maybe it should be machine_device_initcall(pseries, pseries_memory_hotplug_init); (and pseries_cpu_hotplug_init in hotplug-cpu.c should be changed to machine_arch_initcall) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 1/3] maple: use platform name in define_machine()
Prevailing practice for define_machine() in powerpc is to use the platform name when the platform has only one define_machine() statement, but maple uses maple_md. This caused me some head-scratching when writing some new code that uses machine_is(maple). Use maple instead of maple_md. There should not be any behavioral change -- fixup_maple_ide() calls machine_is(maple) but the body of the function is ifdef'd out. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/platforms/maple/setup.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/platforms/maple/setup.c b/arch/powerpc/platforms/maple/setup.c index 3ce2d73..3b32e07 100644 --- a/arch/powerpc/platforms/maple/setup.c +++ b/arch/powerpc/platforms/maple/setup.c @@ -319,7 +319,7 @@ static int __init maple_probe(void) return 1; } -define_machine(maple_md) { +define_machine(maple) { .name = Maple, .probe = maple_probe, .setup_arch = maple_setup_arch, -- 1.5.4.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
maple: minor updates
Please consider the following mostly-trivial patches for 2.6.26. arch/powerpc/configs/maple_defconfig | 133 -- arch/powerpc/platforms/maple/pci.c | 47 arch/powerpc/platforms/maple/setup.c |2 +- 3 files changed, 129 insertions(+), 53 deletions(-) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 2/3] maple: kill fixup_maple_ide
This function has been a no-op for about 18 months; it's there in the history should anyone need to resurrect it. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/platforms/maple/pci.c | 47 1 files changed, 0 insertions(+), 47 deletions(-) diff --git a/arch/powerpc/platforms/maple/pci.c b/arch/powerpc/platforms/maple/pci.c index 3ffa0ac..3018552 100644 --- a/arch/powerpc/platforms/maple/pci.c +++ b/arch/powerpc/platforms/maple/pci.c @@ -592,50 +592,3 @@ int maple_pci_get_legacy_ide_irq(struct pci_dev *pdev, int channel) } return irq; } - -/* XXX: To remove once all firmwares are ok */ -static void fixup_maple_ide(struct pci_dev* dev) -{ - if (!machine_is(maple)) - return; - -#if 0 /* Enable this to enable IDE port 0 */ - { - u8 v; - - pci_read_config_byte(dev, 0x40, v); - v |= 2; - pci_write_config_byte(dev, 0x40, v); - } -#endif -#if 0 /* fix bus master base */ - pci_write_config_dword(dev, 0x20, 0xcc01); - printk(old ide resource: %lx - %lx \n, - dev-resource[4].start, dev-resource[4].end); - dev-resource[4].start = 0xcc00; - dev-resource[4].end = 0xcc10; -#endif -#if 0 /* Enable this to fixup IDE sense/polarity of irqs in IO-APICs */ - { - struct pci_dev *apicdev; - u32 v; - - apicdev = pci_get_slot (dev-bus, PCI_DEVFN(5,0)); - if (apicdev == NULL) - printk(IDE Fixup IRQ: Can't find IO-APIC !\n); - else { - pci_write_config_byte(apicdev, 0xf2, 0x10 + 2*14); - pci_read_config_dword(apicdev, 0xf4, v); - v = ~0x0022; - pci_write_config_dword(apicdev, 0xf4, v); - pci_write_config_byte(apicdev, 0xf2, 0x10 + 2*15); - pci_read_config_dword(apicdev, 0xf4, v); - v = ~0x0022; - pci_write_config_dword(apicdev, 0xf4, v); - pci_dev_put(apicdev); - } - } -#endif -} -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8111_IDE, -fixup_maple_ide); -- 1.5.4.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 3/3] maple: enable ipr driver in defconfig
Some machines supported by the maple platform have an Obsidian controller which can't be used without enabling CONFIG_IPR and the options on which it depends. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/configs/maple_defconfig | 133 -- 1 files changed, 128 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/configs/maple_defconfig b/arch/powerpc/configs/maple_defconfig index 8b810d0..25be62b 100644 --- a/arch/powerpc/configs/maple_defconfig +++ b/arch/powerpc/configs/maple_defconfig @@ -1,7 +1,7 @@ # # Automatically generated make config: don't edit # Linux kernel version: 2.6.24-rc4 -# Thu Dec 6 16:48:26 2007 +# Wed Mar 12 16:21:48 2008 # CONFIG_PPC64=y @@ -333,7 +333,7 @@ CONFIG_DEFAULT_TCP_CONG=cubic CONFIG_UEVENT_HELPER_PATH=/sbin/hotplug CONFIG_STANDALONE=y CONFIG_PREVENT_FIRMWARE_BUILD=y -# CONFIG_FW_LOADER is not set +CONFIG_FW_LOADER=y # CONFIG_DEBUG_DRIVER is not set # CONFIG_DEBUG_DEVRES is not set # CONFIG_SYS_HYPERVISOR is not set @@ -374,6 +374,7 @@ CONFIG_BLK_DEV_IDEDISK=y CONFIG_BLK_DEV_IDECD=y # CONFIG_BLK_DEV_IDETAPE is not set # CONFIG_BLK_DEV_IDEFLOPPY is not set +# CONFIG_BLK_DEV_IDESCSI is not set CONFIG_IDE_TASK_IOCTL=y CONFIG_IDE_PROC_FS=y @@ -427,10 +428,129 @@ CONFIG_IDE_ARCH_OBSOLETE_INIT=y # SCSI device support # # CONFIG_RAID_ATTRS is not set -# CONFIG_SCSI is not set -# CONFIG_SCSI_DMA is not set +CONFIG_SCSI=y +CONFIG_SCSI_DMA=y +# CONFIG_SCSI_TGT is not set # CONFIG_SCSI_NETLINK is not set -# CONFIG_ATA is not set +# CONFIG_SCSI_PROC_FS is not set + +# +# SCSI support type (disk, tape, CD-ROM) +# +CONFIG_BLK_DEV_SD=y +# CONFIG_CHR_DEV_ST is not set +# CONFIG_CHR_DEV_OSST is not set +# CONFIG_BLK_DEV_SR is not set +CONFIG_CHR_DEV_SG=y +# CONFIG_CHR_DEV_SCH is not set + +# +# Some SCSI devices (e.g. CD jukebox) support multiple LUNs +# +# CONFIG_SCSI_MULTI_LUN is not set +# CONFIG_SCSI_CONSTANTS is not set +# CONFIG_SCSI_LOGGING is not set +# CONFIG_SCSI_SCAN_ASYNC is not set +CONFIG_SCSI_WAIT_SCAN=m + +# +# SCSI Transports +# +# CONFIG_SCSI_SPI_ATTRS is not set +# CONFIG_SCSI_FC_ATTRS is not set +# CONFIG_SCSI_ISCSI_ATTRS is not set +# CONFIG_SCSI_SAS_LIBSAS is not set +# CONFIG_SCSI_SRP_ATTRS is not set +CONFIG_SCSI_LOWLEVEL=y +# CONFIG_ISCSI_TCP is not set +# CONFIG_BLK_DEV_3W__RAID is not set +# CONFIG_SCSI_3W_9XXX is not set +# CONFIG_SCSI_ACARD is not set +# CONFIG_SCSI_AACRAID is not set +# CONFIG_SCSI_AIC7XXX is not set +# CONFIG_SCSI_AIC7XXX_OLD is not set +# CONFIG_SCSI_AIC79XX is not set +# CONFIG_SCSI_AIC94XX is not set +# CONFIG_SCSI_ARCMSR is not set +# CONFIG_MEGARAID_NEWGEN is not set +# CONFIG_MEGARAID_LEGACY is not set +# CONFIG_MEGARAID_SAS is not set +# CONFIG_SCSI_HPTIOP is not set +# CONFIG_SCSI_DMX3191D is not set +# CONFIG_SCSI_EATA is not set +# CONFIG_SCSI_FUTURE_DOMAIN is not set +# CONFIG_SCSI_GDTH is not set +# CONFIG_SCSI_IPS is not set +# CONFIG_SCSI_INITIO is not set +# CONFIG_SCSI_INIA100 is not set +# CONFIG_SCSI_STEX is not set +# CONFIG_SCSI_SYM53C8XX_2 is not set +CONFIG_SCSI_IPR=y +CONFIG_SCSI_IPR_TRACE=y +CONFIG_SCSI_IPR_DUMP=y +# CONFIG_SCSI_QLOGIC_1280 is not set +# CONFIG_SCSI_QLA_FC is not set +# CONFIG_SCSI_QLA_ISCSI is not set +# CONFIG_SCSI_LPFC is not set +# CONFIG_SCSI_DC395x is not set +# CONFIG_SCSI_DC390T is not set +# CONFIG_SCSI_DEBUG is not set +# CONFIG_SCSI_SRP is not set +CONFIG_ATA=y +CONFIG_ATA_NONSTANDARD=y +# CONFIG_SATA_AHCI is not set +# CONFIG_SATA_SVW is not set +# CONFIG_ATA_PIIX is not set +# CONFIG_SATA_MV is not set +# CONFIG_SATA_NV is not set +# CONFIG_PDC_ADMA is not set +# CONFIG_SATA_QSTOR is not set +# CONFIG_SATA_PROMISE is not set +# CONFIG_SATA_SX4 is not set +# CONFIG_SATA_SIL is not set +# CONFIG_SATA_SIL24 is not set +# CONFIG_SATA_SIS is not set +# CONFIG_SATA_ULI is not set +# CONFIG_SATA_VIA is not set +# CONFIG_SATA_VITESSE is not set +# CONFIG_SATA_INIC162X is not set +# CONFIG_PATA_ALI is not set +# CONFIG_PATA_AMD is not set +# CONFIG_PATA_ARTOP is not set +# CONFIG_PATA_ATIIXP is not set +# CONFIG_PATA_CMD640_PCI is not set +# CONFIG_PATA_CMD64X is not set +# CONFIG_PATA_CS5520 is not set +# CONFIG_PATA_CS5530 is not set +# CONFIG_PATA_CYPRESS is not set +# CONFIG_PATA_EFAR is not set +# CONFIG_ATA_GENERIC is not set +# CONFIG_PATA_HPT366 is not set +# CONFIG_PATA_HPT37X is not set +# CONFIG_PATA_HPT3X2N is not set +# CONFIG_PATA_HPT3X3 is not set +# CONFIG_PATA_IT821X is not set +# CONFIG_PATA_IT8213 is not set +# CONFIG_PATA_JMICRON is not set +# CONFIG_PATA_TRIFLEX is not set +# CONFIG_PATA_MARVELL is not set +# CONFIG_PATA_MPIIX is not set +# CONFIG_PATA_OLDPIIX is not set +# CONFIG_PATA_NETCELL is not set +# CONFIG_PATA_NS87410 is not set +# CONFIG_PATA_NS87415 is not set +# CONFIG_PATA_OPTI is not set +# CONFIG_PATA_OPTIDMA is not set +# CONFIG_PATA_PDC_OLD is not set +# CONFIG_PATA_RADISYS is not set +# CONFIG_PATA_RZ1000 is not set +# CONFIG_PATA_SC1200 is not set +# CONFIG_PATA_SERVERWORKS
[PATCH] convert pci and eeh code to of_device_is_available
A couple of places are duplicating the function of of_device_is_available; convert them to use it. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- This depends on Josh Boyer's patch [OF] Add of_device_is_available function: http://patchwork.ozlabs.org/linuxppc/patch?id=17100 arch/powerpc/kernel/rtas_pci.c | 19 ++- arch/powerpc/platforms/pseries/eeh.c |5 ++--- 2 files changed, 4 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/kernel/rtas_pci.c b/arch/powerpc/kernel/rtas_pci.c index 433a0a0..39a752b 100644 --- a/arch/powerpc/kernel/rtas_pci.c +++ b/arch/powerpc/kernel/rtas_pci.c @@ -56,21 +56,6 @@ static inline int config_access_valid(struct pci_dn *dn, int where) return 0; } -static int of_device_available(struct device_node * dn) -{ -const char *status; - -status = of_get_property(dn, status, NULL); - -if (!status) -return 1; - -if (!strcmp(status, okay)) -return 1; - -return 0; -} - int rtas_read_config(struct pci_dn *pdn, int where, int size, u32 *val) { int returnval = -1; @@ -117,7 +102,7 @@ static int rtas_pci_read_config(struct pci_bus *bus, for (dn = busdn-child; dn; dn = dn-sibling) { struct pci_dn *pdn = PCI_DN(dn); if (pdn pdn-devfn == devfn -of_device_available(dn)) +of_device_is_available(dn)) return rtas_read_config(pdn, where, size, val); } @@ -164,7 +149,7 @@ static int rtas_pci_write_config(struct pci_bus *bus, for (dn = busdn-child; dn; dn = dn-sibling) { struct pci_dn *pdn = PCI_DN(dn); if (pdn pdn-devfn == devfn -of_device_available(dn)) +of_device_is_available(dn)) return rtas_write_config(pdn, where, size, val); } return PCIBIOS_DEVICE_NOT_FOUND; diff --git a/arch/powerpc/platforms/pseries/eeh.c b/arch/powerpc/platforms/pseries/eeh.c index 9eb539e..550b2f7 100644 --- a/arch/powerpc/platforms/pseries/eeh.c +++ b/arch/powerpc/platforms/pseries/eeh.c @@ -945,7 +945,6 @@ static void *early_enable_eeh(struct device_node *dn, void *data) unsigned int rets[3]; struct eeh_early_enable_info *info = data; int ret; - const char *status = of_get_property(dn, status, NULL); const u32 *class_code = of_get_property(dn, class-code, NULL); const u32 *vendor_id = of_get_property(dn, vendor-id, NULL); const u32 *device_id = of_get_property(dn, device-id, NULL); @@ -959,8 +958,8 @@ static void *early_enable_eeh(struct device_node *dn, void *data) pdn-eeh_freeze_count = 0; pdn-eeh_false_positives = 0; - if (status strncmp(status, ok, 2) != 0) - return NULL;/* ignore devices with bad status */ + if (!of_device_is_available(dn)) + return NULL; /* Ignore bad nodes. */ if (!class_code || !vendor_id || !device_id) -- 1.5.4.rc5 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH v2] Force 4K pages for IO addresses.
Benjamin Herrenschmidt wrote: On Mon, 2008-03-17 at 19:34 -0500, Olof Johansson wrote: On Mon, Mar 17, 2008 at 02:54:19PM +1100, Tony Breeds wrote: Currently HEA requires 4K pages for IO resources. Just set the pages size to IO to 4K. Well, that's too bad. Why penalize all platforms for it? I.e.: Nack, we use 64K iopages on pa6t and it works well. No need to waste tlb and erat space. We would have to make that pSeries specific for now I suppose... We don't have a way to know that there can be an EHEA right ? It may not be in the device-tree at boot and dynamically added to the partition... The ibm,drc-names property of the root node should have HEA strings in it on systems where EHEA can potentially be present. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 2/2] Force 4K IOPages when eHEA is present in the machine.
Tony Breeds wrote: +#if defined(CONFIG_PPC_PSERIES) defined(CONFIG_PPC_64K_PAGES) +static int __init scan_dt_for_ehea(unsigned long node, const char *uname, + int depth, void *data) +{ + if (depth != 0) + return 0; + + if (of_flat_dt_search(node, HEA , ibm,drc-names)) + return 1; + + return 0; +} +#endif Searching ibm,drc-names is necessary, but is it sufficient? That is, can there be a system that has an HEA adapter without that property being present? Thinking in particular about single-partition systems that run without an HMC attached. static void __init htab_init_page_sizes(void) { int rc; +#ifdef CONFIG_PPC_64K_PAGES + int has_ehea = 0; +#endif /* Default to 4K pages only */ memcpy(mmu_psize_defs, mmu_psize_defaults_old, sizeof(mmu_psize_defaults_old)); +#if defined(CONFIG_PPC_PSERIES) defined(CONFIG_PPC_64K_PAGES) + /* Scan to see if this system can have an EHEA, if so we'll + * demote io_psize to 4K */ + has_ehea = of_scan_flat_dt(scan_dt_for_ehea, NULL); +#endif I'm wondering if some of the ifdef stuff could be avoided if you used a firmware feature bit to signify the limitation (or the lack of it). The bit could be set during pSeries_probe. Just an idea; I don't feel that strongly about it. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] scanlog_init cleanup, minor fixes
scanlog_init() could use some love. * properly return -ENODEV if this system doesn't support scan-log-dump * don't printk if scan-log-dump not present; only older systems have it * convert from create_proc_entry() to preferred proc_create() * allocate zeroed data buffer * fix potential memory leak of ent-data on failed create_proc_entry() * simplify control flow Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- (not urgent, please consider for 2.6.26) arch/powerpc/platforms/pseries/scanlog.c | 37 ++--- 1 files changed, 18 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/platforms/pseries/scanlog.c b/arch/powerpc/platforms/pseries/scanlog.c index 8e1ef16..e5b0ea8 100644 --- a/arch/powerpc/platforms/pseries/scanlog.c +++ b/arch/powerpc/platforms/pseries/scanlog.c @@ -195,31 +195,30 @@ const struct file_operations scanlog_fops = { static int __init scanlog_init(void) { struct proc_dir_entry *ent; + void *data; + int err = -ENOMEM; ibm_scan_log_dump = rtas_token(ibm,scan-log-dump); - if (ibm_scan_log_dump == RTAS_UNKNOWN_SERVICE) { - printk(KERN_ERR scan-log-dump not implemented on this system\n); - return -EIO; - } + if (ibm_scan_log_dump == RTAS_UNKNOWN_SERVICE) + return -ENODEV; -ent = create_proc_entry(ppc64/rtas/scan-log-dump, S_IRUSR, NULL); - if (ent) { - ent-proc_fops = scanlog_fops; - /* Ideally we could allocate a buffer 4G */ - ent-data = kmalloc(RTAS_DATA_BUF_SIZE, GFP_KERNEL); - if (!ent-data) { - printk(KERN_ERR Failed to allocate a buffer\n); - remove_proc_entry(scan-log-dump, ent-parent); - return -ENOMEM; - } - ((unsigned int *)ent-data)[0] = 0; - } else { - printk(KERN_ERR Failed to create ppc64/scan-log-dump proc entry\n); - return -EIO; - } + /* Ideally we could allocate a buffer 4G */ + data = kzalloc(RTAS_DATA_BUF_SIZE, GFP_KERNEL); + if (!data) + goto err; + + ent = proc_create(ppc64/rtas/scan-log-dump, S_IRUSR, NULL, + scanlog_fops); + if (!ent) + goto err; + + ent-data = data; proc_ppc64_scan_log_dump = ent; return 0; +err: + kfree(data); + return err; } static void __exit scanlog_cleanup(void) -- 1.5.4.rc5 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] RTAS - adapt procfs interface
Jens Osterkamp wrote: Handling of the proc_dir_entry-count has being changed in 2.6.24-rc5. Do you know which commit caused the change? After this change the default value for pde-count is 1 and not 0 as it was in earlier kernels. Therefore, if we want to check wether our procfs file is already opened (already in use), we have to check if pde-count is not greater than 2 but not 1. Signed-off-by: Maxim Shchetynin [EMAIL PROTECTED] Signed-off-by: Jens Osterkamp [EMAIL PROTECTED] Index: linux-2.6/arch/powerpc/kernel/rtas_flash.c === --- linux-2.6.orig/arch/powerpc/kernel/rtas_flash.c +++ linux-2.6/arch/powerpc/kernel/rtas_flash.c @@ -356,7 +356,7 @@ static int rtas_excl_open(struct inode * /* Enforce exclusive open with use count of PDE */ spin_lock(flash_file_open_lock); - if (atomic_read(dp-count) 1) { + if (atomic_read(dp-count) 2) { spin_unlock(flash_file_open_lock); return -EBUSY; } One could argue that the real problem is using the proc_dir_entry's reference count to enforce exclusive open. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] RTAS - adapt procfs interface
Nathan Lynch wrote: One could argue that the real problem is using the proc_dir_entry's reference count to enforce exclusive open. I think this is better... the way these files are used is lame, but this should preserve the existing behavior. I haven't yet tested this, can you? diff --git a/arch/powerpc/kernel/rtas_flash.c b/arch/powerpc/kernel/rtas_flash.c index f227659..00bc308 100644 --- a/arch/powerpc/kernel/rtas_flash.c +++ b/arch/powerpc/kernel/rtas_flash.c @@ -139,7 +139,7 @@ struct rtas_validate_flash_t unsigned int update_results;/* Update results token */ }; -static DEFINE_SPINLOCK(flash_file_open_lock); +static atomic_t open_count = ATOMIC_INIT(0); static struct proc_dir_entry *firmware_flash_pde; static struct proc_dir_entry *firmware_update_pde; static struct proc_dir_entry *validate_pde; @@ -216,7 +216,7 @@ static int rtas_flash_release(struct inode *inode, struct file *file) uf-flist = NULL; } - atomic_dec(dp-count); + atomic_dec(open_count); return 0; } @@ -352,26 +352,17 @@ static ssize_t rtas_flash_write(struct file *file, const char __user *buffer, static int rtas_excl_open(struct inode *inode, struct file *file) { - struct proc_dir_entry *dp = PDE(inode); - - /* Enforce exclusive open with use count of PDE */ - spin_lock(flash_file_open_lock); - if (atomic_read(dp-count) 1) { - spin_unlock(flash_file_open_lock); + if (atomic_inc_return(open_count) 1) { + atomic_dec(open_count); return -EBUSY; } - atomic_inc(dp-count); - spin_unlock(flash_file_open_lock); - return 0; } static int rtas_excl_release(struct inode *inode, struct file *file) { - struct proc_dir_entry *dp = PDE(inode); - - atomic_dec(dp-count); + atomic_dec(open_count); return 0; } @@ -580,7 +571,7 @@ static int validate_flash_release(struct inode *inode, struct file *file) } /* The matching atomic_inc was in rtas_excl_open() */ - atomic_dec(dp-count); + atomic_dec(open_count); return 0; } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] RTAS - adapt procfs interface
Paul Mackerras wrote: Nathan Lynch writes: I think this is better... the way these files are used is lame, but this should preserve the existing behavior. I haven't yet tested this, can you? Looks OK -- can I have a proper patch description and a signed-off-by line for this please? Actually, my patch has the potentially undesirable consequence of allowing only one of the three flash-related proc files to be open at any time, whereas the previous behavior enforced exclusive open on a per-file basis. If you want something for 2.6.25, I think the patch Jens posted is of lower risk. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/1 resend][PPC] replace logical by bitand in pci_process_ISA_OF_ranges(), arch/powerpc/kernel/isa-bridge.c
Roel Kluin wrote: replace logical by bit and for ISA_SPACE_MASK I think Paul has picked this up now: http://ozlabs.org/pipermail/linuxppc-dev/2008-April/054103.html ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
do section mismatch check on full vmlinux breaks powerpc build
Hello- 2.6.23-rc1 breaks the build for 64-bit powerpc for me (using maple_defconfig): LD vmlinux.o powerpc64-unknown-linux-gnu-ld: dynreloc miscount for kernel/built-in.o, section .opd powerpc64-unknown-linux-gnu-ld: can not edit opd Bad value make: *** [vmlinux.o] Error 1 This is on a i386 host with: powerpc64-unknown-linux-gnu-gcc (GCC) 4.1.2 GNU ld version 2.16.1 Reverting the following commit fixes it: commit 741f98fe298a73c9d47ed53703c1279a29718581 Author: Sam Ravnborg [EMAIL PROTECTED] Date: Tue Jul 17 10:54:06 2007 +0200 kbuild: do section mismatch check on full vmlinux However, I see a possibly related binutils patch: http://article.gmane.org/gmane.comp.gnu.binutils/33650 Will there be a kbuild fix for this or should I update my binutils? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: do section mismatch check on full vmlinux breaks powerpc build
Hi Sam- Sam Ravnborg wrote: On Tue, Jul 24, 2007 at 05:41:05PM -0500, Nathan Lynch wrote: 2.6.23-rc1 breaks the build for 64-bit powerpc for me (using maple_defconfig): LD vmlinux.o powerpc64-unknown-linux-gnu-ld: dynreloc miscount for kernel/built-in.o, section .opd powerpc64-unknown-linux-gnu-ld: can not edit opd Bad value make: *** [vmlinux.o] Error 1 I narrowed it down to the following change to avoid the breakage: diff --git a/include/linux/mm.h b/include/linux/mm.h index c456c3a..2ea222f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1246,7 +1246,7 @@ void drop_slab(void); extern int randomize_va_space; #endif -__attribute__((weak)) const char *arch_vma_name(struct vm_area_struct *vma); +//__attribute__((weak)) const char *arch_vma_name(struct vm_area_struct *vma); #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ So seems that something goes a bit fishy when using weak symbols and this trigges a binutils bug. The above line was introdused in the following commit: commit f269fdd1829acc5e53bf57b145003e5733133f2b Author: David Howells [EMAIL PROTECTED] Date: Wed Sep 27 01:50:23 2006 -0700 [PATCH] NOMMU: move the fallback arch_vma_name() to a sensible place Thanks for looking into this. Removing the __attribute__((weak)) from arch_vma_name's declaration in linux/mm.h unbreaks the build for me. Maybe it shouldn't matter, but it seems unusual to have the weak attribute specified at the function's declaration. I wasn't able to find any examples of that for other weak functions in the kernel (e.g. sched_clock). ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 6/6] pseries: eliminate global var
Linas Vepstas wrote: --- linux-2.6.22-git2.orig/include/asm-powerpc/nvram.h2007-07-08 18:32:17.0 -0500 +++ linux-2.6.22-git2/include/asm-powerpc/nvram.h 2007-08-08 13:34:27.0 -0500 @@ -62,14 +62,19 @@ struct nvram_partition { unsigned int index; }; - -extern int nvram_write_error_log(char * buff, int length, unsigned int err_type); -extern int nvram_read_error_log(char * buff, int length, unsigned int * err_type); -extern int nvram_clear_error_log(void); extern struct nvram_partition *nvram_find_partition(int sig, const char *name); -extern int pSeries_nvram_init(void); extern int mmio_nvram_init(void); + +#ifdef CONFIG_PPC_PSERIES +extern int pSeries_nvram_init(void); +extern int nvram_write_error_log(char * buff, int length, + unsigned int err_type, unsigned int err_seq); +extern int nvram_read_error_log(char * buff, int length, + unsigned int * err_type, unsigned int *err_seq); +extern int nvram_clear_error_log(void); +#endif /* CONFIG_PPC_PSERIES */ Declarations need not be #ifdef'd. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/6] pseries: avoid excess rtas calls
Linas Vepstas wrote: We don't need to look up the rtas event token once per cpu per second. This avoids some misc string ops and rtas calls and provides some minor performance improvement. It does not avoid any calls to RTAS. (rtas_token merely looks up properties under the /rtas node.) Otherwise, looks okay. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 6/6] pseries: eliminate global var
Linas Vepstas wrote: On Wed, Aug 08, 2007 at 04:57:14PM -0500, Nathan Lynch wrote: Linas Vepstas wrote: + +#ifdef CONFIG_PPC_PSERIES +extern int pSeries_nvram_init(void); +extern int nvram_write_error_log(char * buff, int length, + unsigned int err_type, unsigned int err_seq); +extern int nvram_read_error_log(char * buff, int length, + unsigned int * err_type, unsigned int *err_seq); +extern int nvram_clear_error_log(void); +#endif /* CONFIG_PPC_PSERIES */ Declarations need not be #ifdef'd. Ah, I thought that would be cleaner ... should I resubmit, the patch, or does it matter that much? FWIW I'd prefer you drop that hunk and resubmit; it's needless churn that can only cause build problems. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [RFC/PATCH] remove gratuitous reads from maple pci config space methods
Benjamin Herrenschmidt wrote: On Wed, 2007-08-08 at 19:50 -0500, Nathan Lynch wrote: Remove the gratuitous reads from u3_agp_write_config, u3_ht_write_config, and u4_pcie_write_config. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED] --- Thanks ! Care to fix powermac too ? :-) Sure, I'll get it tomorrow... looks like pasemi cribbed the powermac code too :) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [RFC/PATCH] remove gratuitous reads from maple pci config space methods
David Gibson wrote: On Wed, Aug 08, 2007 at 07:50:44PM -0500, Nathan Lynch wrote: The maple pci configuration space write methods read the written location immediately after the write is performed, presumably in order to flush the write. However, configuration space writes are not allowed to be posted, making these reads gratuitous. It might be worth checking that there isn't a particular reason for these. Just because posting writes are forbidden doesn't mean a particular bridge won't screw it up... Well, I had already checked with Ben, who wrote the code, and my understanding is that the reads are intended to work around some misbehaving Apple bridges, but that a sync after the write (implied by releasing pci_lock in the generic pci code) should suffice for those. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 01/11] rtas_pci_ops: use named structure member initializers
Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/kernel/rtas_pci.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/rtas_pci.c b/arch/powerpc/kernel/rtas_pci.c index a5de621..21f14e5 100644 --- a/arch/powerpc/kernel/rtas_pci.c +++ b/arch/powerpc/kernel/rtas_pci.c @@ -171,8 +171,8 @@ static int rtas_pci_write_config(struct pci_bus *bus, } struct pci_ops rtas_pci_ops = { - rtas_pci_read_config, - rtas_pci_write_config + .read = rtas_pci_read_config, + .write = rtas_pci_write_config, }; int is_python(struct device_node *dev) -- 1.5.2.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 02/11] celleb_fake_pci_ops: use named structure member initializers
Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/platforms/celleb/pci.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/celleb/pci.c b/arch/powerpc/platforms/celleb/pci.c index e9ac19c..de8038b 100644 --- a/arch/powerpc/platforms/celleb/pci.c +++ b/arch/powerpc/platforms/celleb/pci.c @@ -242,8 +242,8 @@ static int celleb_fake_pci_write_config(struct pci_bus *bus, } static struct pci_ops celleb_fake_pci_ops = { - celleb_fake_pci_read_config, - celleb_fake_pci_write_config + .read = celleb_fake_pci_read_config, + .write = celleb_fake_pci_write_config, }; static inline void celleb_setup_pci_base_addrs(struct pci_controller *hose, -- 1.5.2.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 03/11] celleb_epci_ops: use named structure member initializers
Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/platforms/celleb/scc_epci.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/celleb/scc_epci.c b/arch/powerpc/platforms/celleb/scc_epci.c index c4b0110..7506acc 100644 --- a/arch/powerpc/platforms/celleb/scc_epci.c +++ b/arch/powerpc/platforms/celleb/scc_epci.c @@ -278,8 +278,8 @@ static int celleb_epci_write_config(struct pci_bus *bus, } struct pci_ops celleb_epci_ops = { - celleb_epci_read_config, - celleb_epci_write_config, + .read = celleb_epci_read_config, + .write = celleb_epci_write_config, }; /* to be moved in FW */ -- 1.5.2.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 04/11] maple pci_ops: use named structure member initializers
Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/platforms/maple/pci.c | 12 ++-- 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/maple/pci.c b/arch/powerpc/platforms/maple/pci.c index 2542403..6247f94 100644 --- a/arch/powerpc/platforms/maple/pci.c +++ b/arch/powerpc/platforms/maple/pci.c @@ -185,8 +185,8 @@ static int u3_agp_write_config(struct pci_bus *bus, unsigned int devfn, static struct pci_ops u3_agp_pci_ops = { - u3_agp_read_config, - u3_agp_write_config + .read = u3_agp_read_config, + .write = u3_agp_write_config, }; static unsigned long u3_ht_cfa0(u8 devfn, u8 off) @@ -284,8 +284,8 @@ static int u3_ht_write_config(struct pci_bus *bus, unsigned int devfn, static struct pci_ops u3_ht_pci_ops = { - u3_ht_read_config, - u3_ht_write_config + .read = u3_ht_read_config, + .write = u3_ht_write_config, }; static unsigned int u4_pcie_cfa0(unsigned int devfn, unsigned int off) @@ -392,8 +392,8 @@ static int u4_pcie_write_config(struct pci_bus *bus, unsigned int devfn, static struct pci_ops u4_pcie_pci_ops = { -u4_pcie_read_config, -u4_pcie_write_config + .read = u4_pcie_read_config, + .write = u4_pcie_write_config, }; static void __init setup_u3_agp(struct pci_controller* hose) -- 1.5.2.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 05/11] pa_pxp_ops: use named structure member initializers
Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/platforms/pasemi/pci.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pasemi/pci.c b/arch/powerpc/platforms/pasemi/pci.c index ab1f5f6..882b571 100644 --- a/arch/powerpc/platforms/pasemi/pci.c +++ b/arch/powerpc/platforms/pasemi/pci.c @@ -122,8 +122,8 @@ static int pa_pxp_write_config(struct pci_bus *bus, unsigned int devfn, } static struct pci_ops pa_pxp_ops = { - pa_pxp_read_config, - pa_pxp_write_config, + .read = pa_pxp_read_config, + .write = pa_pxp_write_config, }; static void __init setup_pa_pxp(struct pci_controller *hose) -- 1.5.2.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 06/11] powermac pci_ops: use named structure member initializers
Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/platforms/powermac/pci.c | 16 1 files changed, 8 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/platforms/powermac/pci.c b/arch/powerpc/platforms/powermac/pci.c index 92586db..69d67ff 100644 --- a/arch/powerpc/platforms/powermac/pci.c +++ b/arch/powerpc/platforms/powermac/pci.c @@ -225,8 +225,8 @@ static int macrisc_write_config(struct pci_bus *bus, unsigned int devfn, static struct pci_ops macrisc_pci_ops = { - macrisc_read_config, - macrisc_write_config + .read = macrisc_read_config, + .write = macrisc_write_config, }; #ifdef CONFIG_PPC32 @@ -280,8 +280,8 @@ chaos_write_config(struct pci_bus *bus, unsigned int devfn, int offset, static struct pci_ops chaos_pci_ops = { - chaos_read_config, - chaos_write_config + .read = chaos_read_config, + .write = chaos_write_config, }; static void __init setup_chaos(struct pci_controller *hose, @@ -456,8 +456,8 @@ static int u3_ht_write_config(struct pci_bus *bus, unsigned int devfn, static struct pci_ops u3_ht_pci_ops = { - u3_ht_read_config, - u3_ht_write_config + .read = u3_ht_read_config, + .write = u3_ht_write_config, }; #define U4_PCIE_CFA0(devfn, off) \ @@ -561,8 +561,8 @@ static int u4_pcie_write_config(struct pci_bus *bus, unsigned int devfn, static struct pci_ops u4_pcie_pci_ops = { - u4_pcie_read_config, - u4_pcie_write_config + .read = u4_pcie_read_config, + .write = u4_pcie_write_config, }; #endif /* CONFIG_PPC64 */ -- 1.5.2.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 07/11] null_pci_ops: use named structure member initializers
Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/kernel/pci_32.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/pci_32.c b/arch/powerpc/kernel/pci_32.c index 04a3109..0e2bee4 100644 --- a/arch/powerpc/kernel/pci_32.c +++ b/arch/powerpc/kernel/pci_32.c @@ -1457,8 +1457,8 @@ null_write_config(struct pci_bus *bus, unsigned int devfn, int offset, static struct pci_ops null_pci_ops = { - null_read_config, - null_write_config + .read = null_read_config, + .write = null_write_config, }; /* -- 1.5.2.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 08/11] efika rtas_pci_ops: use named structure member initializers
Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/platforms/52xx/efika.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/52xx/efika.c b/arch/powerpc/platforms/52xx/efika.c index 4be6e7a..4263158 100644 --- a/arch/powerpc/platforms/52xx/efika.c +++ b/arch/powerpc/platforms/52xx/efika.c @@ -78,8 +78,8 @@ static int rtas_write_config(struct pci_bus *bus, unsigned int devfn, } static struct pci_ops rtas_pci_ops = { - rtas_read_config, - rtas_write_config + .read = rtas_read_config, + .write = rtas_write_config, }; -- 1.5.2.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 09/11] chrp pci_ops: use named structure member initializers
Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/platforms/chrp/pci.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/chrp/pci.c b/arch/powerpc/platforms/chrp/pci.c index 28d1647..0c6dba9 100644 --- a/arch/powerpc/platforms/chrp/pci.c +++ b/arch/powerpc/platforms/chrp/pci.c @@ -86,8 +86,8 @@ int gg2_write_config(struct pci_bus *bus, unsigned int devfn, int off, static struct pci_ops gg2_pci_ops = { - gg2_read_config, - gg2_write_config + .read = gg2_read_config, + .write = gg2_write_config, }; /* @@ -124,8 +124,8 @@ int rtas_write_config(struct pci_bus *bus, unsigned int devfn, int offset, static struct pci_ops rtas_pci_ops = { - rtas_read_config, - rtas_write_config + .read = rtas_read_config, + .write = rtas_write_config, }; volatile struct Hydra __iomem *Hydra = NULL; -- 1.5.2.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 11/11] tsi108_direct_pci_ops: use named structure member initializers
Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/sysdev/tsi108_pci.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/sysdev/tsi108_pci.c b/arch/powerpc/sysdev/tsi108_pci.c index 90db8a7..cf0bfbd 100644 --- a/arch/powerpc/sysdev/tsi108_pci.c +++ b/arch/powerpc/sysdev/tsi108_pci.c @@ -193,8 +193,8 @@ void tsi108_clear_pci_cfg_error(void) } static struct pci_ops tsi108_direct_pci_ops = { - tsi108_direct_read_config, - tsi108_direct_write_config + .read = tsi108_direct_read_config, + .write = tsi108_direct_write_config, }; int __init tsi108_setup_pci(struct device_node *dev, u32 cfg_phys, int primary) -- 1.5.2.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [RFC/PATCH] remove gratuitous reads from maple pci config space methods
Segher Boessenkool wrote: It might be worth checking that there isn't a particular reason for these. Just because posting writes are forbidden doesn't mean a particular bridge won't screw it up... Well, I had already checked with Ben, who wrote the code, and my understanding is that the reads are intended to work around some misbehaving Apple bridges, None of the PCI interfaces on the U3 or U4 bridges have that problem as far as I know. I think the workaround was copied from code for older Apple bridges? Okay, then the change should be fine for maple. but that a sync after the write (implied by releasing pci_lock in the generic pci code) should suffice for those. I don't see how a sync could help here at all, not more than an eieio anyway? Alright, well, maybe take it up with Ben when I post the patch for powermac, since that's where it could actually matter. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] remove gratuitous reads from powermac pci config space methods
The powermac pci configuration space write methods read the written location immediately after the write is performed, presumably in order to flush the write. However, configuration space writes are not allowed to be posted, making these reads gratuitous. Furthermore, this behavior potentially causes us to violate the PCI PM spec when changing between e.g. D0 and D3 states, because a delay of up to 10ms may be required before the OS accesses configuration space after the write which initiates the transition. Remove the unnecessary reads from macrisc_write_config, u3_ht_write_config, and u4_pcie_write_config. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/platforms/powermac/pci.c |9 - 1 files changed, 0 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/powermac/pci.c b/arch/powerpc/platforms/powermac/pci.c index 92586db..bf1f5d1 100644 --- a/arch/powerpc/platforms/powermac/pci.c +++ b/arch/powerpc/platforms/powermac/pci.c @@ -209,15 +209,12 @@ static int macrisc_write_config(struct pci_bus *bus, unsigned int devfn, switch (len) { case 1: out_8(addr, val); - (void) in_8(addr); break; case 2: out_le16(addr, val); - (void) in_le16(addr); break; default: out_le32(addr, val); - (void) in_le32(addr); break; } return PCIBIOS_SUCCESSFUL; @@ -440,15 +437,12 @@ static int u3_ht_write_config(struct pci_bus *bus, unsigned int devfn, switch (len) { case 1: out_8(addr, val); - (void) in_8(addr); break; case 2: out_le16(addr, val); - (void) in_le16(addr); break; default: out_le32((u32 __iomem *)addr, val); - (void) in_le32(addr); break; } return PCIBIOS_SUCCESSFUL; @@ -545,15 +539,12 @@ static int u4_pcie_write_config(struct pci_bus *bus, unsigned int devfn, switch (len) { case 1: out_8(addr, val); - (void) in_8(addr); break; case 2: out_le16(addr, val); - (void) in_le16(addr); break; default: out_le32(addr, val); - (void) in_le32(addr); break; } return PCIBIOS_SUCCESSFUL; -- 1.5.2.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 2.6.23] ibmebus: Prevent bus_id collisions
Hi Joachim- Joachim Fenkes wrote: Nathan Lynch [EMAIL PROTECTED] wrote on 29.08.2007 20:12:32: Will anything break? Nope. Userspace programs should not depend on ibmebus' way of naming the devices; especially since some overly long loc_codes tended to be truncated and thus rendered useless. I have tested IBM's DLPAR tools against the changed kernel, and they didn't break. Okay. Also, I dislike this approach of duplicating the firmware device tree path in sysfs. Why? Any specific reasons for your dislike? struct device's bus_id field is but 20 bytes in size. Too close for comfort? Are GX/ibmebus devices guaranteed to be children of the same node in the OF device tree? If so, their unit addresses will be unique, and therefore suitable values for bus_id. I believe this is what the powerpc vio bus code does. While there's no such guarantee (as in officially signed document), yes, I expect future GX devices to also appear beneath the OFDT root node. For the existing devices, the unit addresses are already part of the device name, so I save the need to use sprintf() again. Plus, I rather like using the full_name since it also contains a descriptive name as opposed to being just nondescript numbers, helping the layman (ie user) to make sense out of a dev_id. Okay, but your layman isn't supposed to be relying on any user-friendly properties of the name :) Hope he doesn't work on a distro installer. Anyway, if you're still confident in this approach, I relent. :) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 08/12] IB/ehca: Replace get_paca()-paca_index by the more portable smp_processor_id()
Hi, Joachim Fenkes wrote: Signed-off-by: Joachim Fenkes [EMAIL PROTECTED] --- drivers/infiniband/hw/ehca/ehca_tools.h | 14 +++--- 1 files changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h index f9b264b..863f972 100644 --- a/drivers/infiniband/hw/ehca/ehca_tools.h +++ b/drivers/infiniband/hw/ehca/ehca_tools.h @@ -73,37 +73,37 @@ extern int ehca_debug_level; if (unlikely(ehca_debug_level)) \ dev_printk(KERN_DEBUG, (ib_dev)-dma_device, \ PU%04x EHCA_DBG:%s format \n, \ -get_paca()-paca_index, __FUNCTION__, \ +smp_processor_id(), __FUNCTION__, \ ## arg); \ } while (0) #define ehca_info(ib_dev, format, arg...) \ dev_info((ib_dev)-dma_device, PU%04x EHCA_INFO:%s format \n, \ - get_paca()-paca_index, __FUNCTION__, ## arg) + smp_processor_id(), __FUNCTION__, ## arg) #define ehca_warn(ib_dev, format, arg...) \ dev_warn((ib_dev)-dma_device, PU%04x EHCA_WARN:%s format \n, \ - get_paca()-paca_index, __FUNCTION__, ## arg) + smp_processor_id(), __FUNCTION__, ## arg) #define ehca_err(ib_dev, format, arg...) \ dev_err((ib_dev)-dma_device, PU%04x EHCA_ERR:%s format \n, \ - get_paca()-paca_index, __FUNCTION__, ## arg) + smp_processor_id(), __FUNCTION__, ## arg) I think I see these macros used in preemptible code (e.g. ehca_probe), where smp_processor_id() will print a warning when CONFIG_DEBUG_PREEMPT=y. Probably better to use raw_smp_processor_id. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Hard hang in hypervisor!?
Linas Vepstas wrote: I was futzing with linux-2.6.23-rc8-mm1 in a power6 lpar when, for whatever reason, a spinlock locked up. The bizarre thing was that the rest of system locked up as well: an ssh terminal, and also an hvc console. Breaking into the debugger showed 4 cpus, 1 of which was deadlocked in the spinlock, and the other 3 in .pseries_dedicated_idle_sleep This was, ahhh, unexpected. What's up with that? Can anyone provide any insight? Sounds consistent with a task trying to double-acquire the lock, or an interrupt handler attempting to acquire a lock that the current task holds. Or maybe even an uninitialized spinlock. Do you know which lock it was? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Module with 36K relocation entries
Medve Emilian-EMMEDVE1 wrote: I have module with 36K relocation entries (I know, I know, how could that be...) and the count_relocs() function takes ~17 seconds to crunch through the relocation table from the .rela.text section. I don't think I can reduce the number of entries in the relocation table (can I?) so I'm thinking at improving the performance of count_relocs() (currently O(n^2)). Does anybody have a simpler idea? Does anyone have any constructive suggestion on how to improve the situation? This seems to come up every few months. There was a patch submitted here: http://ozlabs.org/pipermail/linuxppc-dev/2007-June/037641.html ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: 2.6.24-rc1 freezes on powerbook at first boot stage
(cc'ing linuxppc-dev, see http://www.mail-archive.com/[EMAIL PROTECTED]/msg221770.html for original post and .config) Elimar Riesebieter wrote: On Wed, 24 Oct 2007 the mental interface of Elimar Riesebieter told: [...] The kernel is loaded from firmware but freezes at the moment to load the radeon framebuffer. I can't get any debug info (don't know how?). Screen dump till freeze: Using PowerMac machine description Total memory = 1024MB; using 2048kB for hash table (at cfe0) Linux version 2.6.24-rc1-aragorn ([EMAIL PROTECTED]) (gcc version 4.2.3 20071014 (prelelease) (Debian 4.2.2-3)) #1 Wed Oct 24 12:48:27 CEST 2007 Found UniNorth memory controller host bridge @ 0xf800 revision: 0xd2 Mapped at 0xfdfc Found a Intrepid mac-io controller, rev: 0, mapped at 0xfdf4 PowerMac motherboard: PowerBook G4 15 console [udbg0] enabled setup_arch: bootmem Found UniNorth PCI host bridge at 0xf000. Firmware bus number: 0-1 Found UniNorth PCI host bridge at 0xf200. Firmware bus number: 0-1 Found UniNorth PCI host bridge at 0xf400. Firmware bus number: 0-1 via-pmu: Server Mode is disabled PMU driver v2 initialized for Core99, firmware: 0c nvram: Checking bank 0... nvram: gen0=741, gen1=740 nvram: Active bank is: 0 nvram: OF partition at 0x410 nvram: XP partition at 0x1020 nvram: NR partition at 0x1120 arch: exit Zone PFN ranges: DMA 0 - 196608 Normal 196608 - 196608 HighMem196608 - 262144 Movable zone start PFN for each node early_node_map[1] active PFN ranges 0:0 - 262144 Built 1 zonelists in Zone order. Total pages: 260096 Kernel command line: root=/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]:5 root=/dev/hda5 mpic: Setting up MPIC MPIC 1version 1.2 at 8004, max 4 CPUs mpic: ISU size: 64, shift: 6, mask: 3f mpic: Initializing for 64 sources PID hash table entries: 4096 (order: 12, 16384 bytes) GMT Delta read from XPRAM: 0 minutes, DST: off clocksource: timebase mult[d9038e4] shift [22] registered clockevent: decremeter mult[4b7] shift[16] cpu[0] Console: colour dummy device 80x25 console handover: boot [udbg0] - real [tty0] Does 2.6.23 (or any earlier kernel) work? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] Balance alloc/free and ioremap/iounmap in gpio_mdio_probe (powerpc/platforms/pasemi/gpio_mdio.c)
Hi, Roel Kluin wrote: I think this is how it should be done, but please review: it was not tested. -- Balance alloc/free and ioremap/iounmap It would be more helpful if your changelog identified the objects which could be leaked. More comments below. -static int __devinit gpio_mdio_probe(struct of_device *ofdev, - const struct of_device_id *match) +static int __devinit __gpio_mdio_register_bus(struct of_device *ofdev, + const struct of_device_id *match) { struct device *dev = ofdev-dev; struct device_node *np = ofdev-node; - struct device_node *gpio_np; struct mii_bus *new_bus; struct resource res; struct gpio_priv *priv; const unsigned int *prop; - int err = 0; int i; - gpio_np = of_find_compatible_node(NULL, gpio, 1682m-gpio); - - if (!gpio_np) - return -ENODEV; - - err = of_address_to_resource(gpio_np, 0, res); - of_node_put(gpio_np); - - if (err) - return -EINVAL; - - if (!gpio_regs) - gpio_regs = ioremap(res.start, 0x100); - - if (!gpio_regs) - return -EPERM; - priv = kzalloc(sizeof(struct gpio_priv), GFP_KERNEL); - if (priv == NULL) + if (unlikely(priv == NULL)) return -ENOMEM; Please don't add likely/unlikely to non-hot paths such as this. new_bus = kzalloc(sizeof(struct mii_bus), GFP_KERNEL); - if (new_bus == NULL) - return -ENOMEM; + if (unlikely(new_bus == NULL)) + goto free_priv; again new_bus-name = pasemi gpio mdio bus, new_bus-read = gpio_mdio_read, new_bus-write = gpio_mdio_write, new_bus-reset = gpio_mdio_reset, prop = of_get_property(np, reg, NULL); new_bus-id = *prop; new_bus-priv = priv; new_bus-phy_mask = 0; new_bus-irq = kmalloc(sizeof(int)*PHY_MAX_ADDR, GFP_KERNEL); + if (unlikely(new_bus-irq == NULL)) + goto free_new_bus; + again for(i = 0; i PHY_MAX_ADDR; ++i) new_bus-irq[i] = irq_create_mapping(NULL, 10); prop = of_get_property(np, mdc-pin, NULL); priv-mdc_pin = *prop; prop = of_get_property(np, mdio-pin, NULL); priv-mdio_pin = *prop; new_bus-dev = dev; dev_set_drvdata(dev, new_bus); err = mdiobus_register(new_bus); - if (0 != err) { + if (unlikely(0 != err)) { again printk(KERN_ERR %s: Cannot register as MDIO bus, err %d\n, new_bus-name, err); goto bus_register_fail; } return 0; bus_register_fail: + kfree(new_bus-irq); +free_new_bus: kfree(new_bus); +free_priv: + kfree(priv); + + return -ENOMEM; +} + + +static int __devinit gpio_mdio_probe(struct of_device *ofdev, + const struct of_device_id *match) +{ + struct device_node *gpio_np; + int err; + + gpio_np = of_find_compatible_node(NULL, gpio, 1682m-gpio); + + if (!gpio_np) + return -ENODEV; + + err = of_address_to_resource(gpio_np, 0, res); Hmm, where is res defined? + of_node_put(gpio_np); + + if (err) + return -EINVAL; + + if (!gpio_regs) { + Unneeded newline + gpio_regs = ioremap(res.start, 0x100); + if (unlikely(!gpio_regs)) + return -EPERM; + + err = __gpio_mdio_register_bus(ofdev, match); + if (err 0) + iounmap(gpio_regs); + } else + err = __gpio_mdio_register_bus(ofdev, match); return err; + again } static int gpio_mdio_remove(struct of_device *dev) { struct mii_bus *bus = dev_get_drvdata(dev-dev); mdiobus_unregister(bus); ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[RFC/PATCH] reduce load time for modules with lots of relocs
Nathan Lynch wrote: Medve Emilian-EMMEDVE1 wrote: I have module with 36K relocation entries (I know, I know, how could that be...) and the count_relocs() function takes ~17 seconds to crunch through the relocation table from the .rela.text section. I don't think I can reduce the number of entries in the relocation table (can I?) so I'm thinking at improving the performance of count_relocs() (currently O(n^2)). Does anybody have a simpler idea? Does anyone have any constructive suggestion on how to improve the situation? This seems to come up every few months. There was a patch submitted here: http://ozlabs.org/pipermail/linuxppc-dev/2007-June/037641.html I think this comes up often enough for us to fix it (IIRC unionfs people complained about it long ago too), and it's kind of lame to spend 17 seconds in the kernel without scheduling. So I dug up some old patches that should reduce the complexity to O(n logn), using the sort() function, which IMO is preferable to doing our own hash thing as that patch from June does. Only the 64-bit code is tested. Does this help your situation? arch/powerpc/kernel/module_32.c | 85 +++- arch/powerpc/kernel/module_64.c | 80 ++--- 2 files changed, 132 insertions(+), 33 deletions(-) diff --git a/arch/powerpc/kernel/module_32.c b/arch/powerpc/kernel/module_32.c index 07a89a3..001b941 100644 --- a/arch/powerpc/kernel/module_32.c +++ b/arch/powerpc/kernel/module_32.c @@ -24,6 +24,7 @@ #include linux/kernel.h #include linux/cache.h #include linux/bug.h +#include linux/sort.h #include setup.h @@ -50,26 +51,64 @@ void module_free(struct module *mod, void *module_region) table entries. */ } +static int reloc_cmp(const void *_a, const void *_b) +{ + const Elf32_Rela *a = *(Elf32_Rela **)_a, *b = *(Elf32_Rela **)_b; + + if (a-r_info b-r_info) + return -1; + else if (a-r_info b-r_info) + return 1; + else if (a-r_addend b-r_addend) + return -1; + else if (a-r_addend b-r_addend) + return 1; + + return 0; +} + + /* Count how many different relocations (different symbol, different addend) */ static unsigned int count_relocs(const Elf32_Rela *rela, unsigned int num) { - unsigned int i, j, ret = 0; + unsigned int i, sorted_count = 0; + Elf32_Word last_info; + Elf32_Sword last_addend; + Elf32_Rela **sortbuf = NULL; + + if (num == 0) + return 0; + + sortbuf = vmalloc(num * sizeof(*sortbuf)); + + if (!sortbuf) + return -ENOMEM; + + for (i = 0; i num; i++) + sortbuf[i] = (Elf32_Rela *)rela[i]; + + sort(sortbuf, i, sizeof(sortbuf[0]), reloc_cmp, NULL); + + last_info = sortbuf[0]-r_info; + last_addend = sortbuf[0]-r_addend; + sorted_count = 1; - /* Sure, this is order(n^2), but it's usually short, and not - time critical */ for (i = 0; i num; i++) { - for (j = 0; j i; j++) { - /* If this addend appeared before, it's - already been counted */ - if (ELF32_R_SYM(rela[i].r_info) - == ELF32_R_SYM(rela[j].r_info) -rela[i].r_addend == rela[j].r_addend) - break; - } - if (j == i) ret++; + /* If this r_info,r_addend tuple matches the previous +* entry, don't count it again +*/ + if (sortbuf[i]-r_info == last_info + sortbuf[i]-r_addend == last_addend) + continue; + + last_info = sortbuf[i]-r_info; + last_addend = sortbuf[i]-r_addend; + sorted_count++; } - return ret; + + vfree(sortbuf); + return sorted_count; } /* Get the potential trampolines size required of the init and @@ -96,15 +135,19 @@ static unsigned long get_plt_size(const Elf32_Ehdr *hdr, continue; if (sechdrs[i].sh_type == SHT_RELA) { + int count; DEBUGP(Found relocations in section %u\n, i); DEBUGP(Ptr: %p. Number: %u\n, (void *)hdr + sechdrs[i].sh_offset, sechdrs[i].sh_size / sizeof(Elf32_Rela)); - ret += count_relocs((void *)hdr + count = count_relocs((void *)hdr + sechdrs[i].sh_offset, sechdrs[i].sh_size / sizeof(Elf32_Rela)) * sizeof(struct ppc_plt_entry); + if (count 0
[RFC/PATCH] Fix rtas_ibm_suspend_me bugs
(very rfc for now, no sign-off, needs more testing) There are a couple of bugs in the rtas_ibm_suspend_me() and rtas_percpu_suspend_me() functions: 1. rtas_ibm_suspend_me() uses on_each_cpu() to invoke rtas_percpu_suspend_me() via IPI: if (on_each_cpu(rtas_percpu_suspend_me, data, 1, 0)) ... 'data' is on the stack, and rtas_ibm_suspend_me() takes no measures to ensure that all instances of rtas_percpu_suspend_me() are finished accessing 'data' before returning. This can result in the IPI'd cpus accessing random stack data and getting stuck in H_JOIN. Fix this by moving rtas_suspend_me_data off the stack, and protect it with a mutex. Use a completion to ensure that all cpus are done accessing the data before unlocking. 2. rtas_percpu_suspend_me passes the Linux logical cpu id to the H_PROD hypervisor method, when it should be using the platform interrupt server value for that cpu (hard_smp_processor_id). In practice, this probably causes problems only on partitions where processors have been removed and added in a particular order. --- arch/powerpc/kernel/rtas.c | 64 --- 1 files changed, 47 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 2147807..24faaea 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -19,6 +19,9 @@ #include linux/init.h #include linux/capability.h #include linux/delay.h +#include linux/mutex.h +#include linux/completion.h +#include linux/smp.h #include asm/prom.h #include asm/rtas.h @@ -34,6 +37,7 @@ #include asm/lmb.h #include asm/udbg.h #include asm/syscalls.h +#include asm/atomic.h struct rtas_t rtas = { .lock = SPIN_LOCK_UNLOCKED @@ -41,10 +45,21 @@ struct rtas_t rtas = { EXPORT_SYMBOL(rtas); struct rtas_suspend_me_data { + atomic_t working; /* number of cpus accessing rtas_suspend_me_data */ long waiting; struct rtas_args *args; + struct completion done; /* wait on this until working == 0 */ }; +static void rtas_suspend_me_data_init(struct rtas_suspend_me_data *rsmd, + struct rtas_args *args) +{ + atomic_set(rsmd-working, 0); + rsmd-waiting = 1; + rsmd-args = args; + init_completion(rsmd-done); +} + DEFINE_SPINLOCK(rtas_data_buf_lock); EXPORT_SYMBOL(rtas_data_buf_lock); @@ -671,36 +686,49 @@ static void rtas_percpu_suspend_me(void *info) * we set it to 0. */ local_irq_save(flags); + atomic_inc(data-working); do { rc = plpar_hcall_norets(H_JOIN); smp_rmb(); } while (rc == H_SUCCESS data-waiting 0); if (rc == H_SUCCESS) + /* join is complete (or there was an error) and this +* cpu was prodded +*/ goto out; if (rc == H_CONTINUE) { + /* this cpu does the join */ data-waiting = 0; data-args-args[data-args-nargs] = rtas_call(ibm_suspend_me_token, 0, 1, NULL); - for_each_possible_cpu(i) - plpar_hcall_norets(H_PROD,i); } else { data-waiting = -EBUSY; printk(KERN_ERR Error on H_JOIN hypervisor call\n); } + /* This cpu did the join or got an error, so we need to prod +* everyone else. Extra prods are harmless. +*/ + for_each_online_cpu(i) + plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(i)); + out: + if (atomic_dec_return(data-working) == 0) + complete(data-done); local_irq_restore(flags); return; } +static DEFINE_MUTEX(rsm_lock); /* protects rsm_data */ +static struct rtas_suspend_me_data rsm_data; + static int rtas_ibm_suspend_me(struct rtas_args *args) { - int i; + int err; long state; long rc; unsigned long retbuf[PLPAR_HCALL_BUFSIZE]; - struct rtas_suspend_me_data data; /* Make sure the state is valid */ rc = plpar_hcall(H_VASI_STATE, retbuf, @@ -721,25 +749,27 @@ static int rtas_ibm_suspend_me(struct rtas_args *args) return 0; } - data.waiting = 1; - data.args = args; + mutex_lock(rsm_lock); + + rtas_suspend_me_data_init(rsm_data, args); - /* Call function on all CPUs. One of us will make the -* rtas call + /* Call function on all CPUs. One of us (but not necessarily +* this one) will make the ibm,suspend-me call. */ - if (on_each_cpu(rtas_percpu_suspend_me, data, 1, 0)) - data.waiting = -EINVAL; + if (on_each_cpu(rtas_percpu_suspend_me, rsm_data, 1, 0)) + rsm_data.waiting = -EINVAL; + + /* Must wait for all IPIs to complete before unlocking */ + wait_for_completion(rsm_data.done); - if (data.waiting != 0) + if
Re: [RFC/PATCH] reduce load time for modules with lots of relocs
Medve Emilian wrote: Would it be possible to do the sort in place (without the extra buffer)? I mean would that upset any other part of the kernel? I suspect so. Sounds like you've got the good testcase, maybe you should try it and see. :) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [RFC/PATCH] Fix rtas_ibm_suspend_me bugs
Nathan Lynch wrote: (very rfc for now, no sign-off, needs more testing) There are a couple of bugs in the rtas_ibm_suspend_me() and rtas_percpu_suspend_me() functions: 1. rtas_ibm_suspend_me() uses on_each_cpu() to invoke rtas_percpu_suspend_me() via IPI: if (on_each_cpu(rtas_percpu_suspend_me, data, 1, 0)) ... 'data' is on the stack, and rtas_ibm_suspend_me() takes no measures to ensure that all instances of rtas_percpu_suspend_me() are finished accessing 'data' before returning. This can result in the IPI'd cpus accessing random stack data and getting stuck in H_JOIN. Another possible issue is that H_JOIN requires MSR.EE to be off, but lazy interrupt disabling could conceivably allow that constraint to be violated if we end up doing H_JOIN on the cpu which calls on_each_cpu(). At least I think so... ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH v2] fix multiple bugs in rtas_ibm_suspend_me code
There are several issues with the rtas_ibm_suspend_me code, which enables platform-assisted suspension of an LPAR for migration or hibernation as covered in PAPR 2.2. 1.) rtas_ibm_suspend_me uses on_each_cpu() to invoke rtas_percpu_suspend_me on all cpus via IPI: if (on_each_cpu(rtas_percpu_suspend_me, data, 1, 0)) ... 'data' is on the calling task's stack, but rtas_ibm_suspend_me takes no measures to ensure that all instances of rtas_percpu_suspend_me are finished accessing 'data' before returning. This can result in the IPI'd cpus accessing random stack data and getting stuck in H_JOIN. This is addressed by using an atomic count of workers and a completion on the stack. 2.) rtas_percpu_suspend_me is needlessly calling H_JOIN in a loop. The only event that can cause a cpu to return from H_JOIN is an H_PROD from another cpu or a NMI/system reset. Each cpu need call H_JOIN only once per suspend operation. Remove the loop and the now unnecessary 'waiting' state variable. 3.) H_JOIN must be called with MSR[EE] off, but lazy interrupt disabling may cause the caller of rtas_ibm_suspend_me to call H_JOIN with it on; the local_irq_disable() in on_each_cpu() is not sufficient. Fix this by explicitly saving the MSR and clearing the EE bit before calling H_JOIN. 4.) H_PROD is being called with the Linux logical cpu number as the parameter, not the platform interrupt server value. (It's also being called for all possible cpus, which is harmless, but unnecessary.) This is fixed by calling H_PROD for each online cpu using get_hard_smp_processor_id(cpu) for the argument. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/kernel/rtas.c | 94 +--- 1 files changed, 53 insertions(+), 41 deletions(-) Changes since v1: - the completion is adequate for ensuring that all IPIs have completed, so there's no need for a mutex and rtas_suspend_me_data can be on the stack as before. - add fix for hard-disabling interrupts before H_JOIN - remove unnecessary H_JOIN loop and state machinery diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 2147807..3afe771 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -41,8 +41,10 @@ struct rtas_t rtas = { EXPORT_SYMBOL(rtas); struct rtas_suspend_me_data { - long waiting; - struct rtas_args *args; + atomic_t working; /* number of cpus accessing this struct */ + int token; /* ibm,suspend-me */ + int error; + struct completion *complete; /* wait on this until working == 0 */ }; DEFINE_SPINLOCK(rtas_data_buf_lock); @@ -657,50 +659,62 @@ static int ibm_suspend_me_token = RTAS_UNKNOWN_SERVICE; #ifdef CONFIG_PPC_PSERIES static void rtas_percpu_suspend_me(void *info) { - int i; long rc; - long flags; + unsigned long msr_save; + int cpu; struct rtas_suspend_me_data *data = (struct rtas_suspend_me_data *)info; - /* -* We use waiting to indicate our state. As long -* as it is 0, we are still trying to all join up. -* If it goes to 0, we have successfully joined up and -* one thread got H_CONTINUE. If any error happens, -* we set it to 0. -*/ - local_irq_save(flags); - do { - rc = plpar_hcall_norets(H_JOIN); - smp_rmb(); - } while (rc == H_SUCCESS data-waiting 0); - if (rc == H_SUCCESS) - goto out; + atomic_inc(data-working); + + /* really need to ensure MSR.EE is off for H_JOIN */ + msr_save = mfmsr(); + mtmsr(msr_save ~(MSR_EE)); + + rc = plpar_hcall_norets(H_JOIN); + + mtmsr(msr_save); - if (rc == H_CONTINUE) { - data-waiting = 0; - data-args-args[data-args-nargs] = - rtas_call(ibm_suspend_me_token, 0, 1, NULL); - for_each_possible_cpu(i) - plpar_hcall_norets(H_PROD,i); + if (rc == H_SUCCESS) { + /* This cpu was prodded and the suspend is complete. */ + goto out; + } else if (rc == H_CONTINUE) { + /* All other cpus are in H_JOIN, this cpu does +* the suspend. +*/ + printk(KERN_DEBUG calling ibm,suspend-me on cpu %i\n, + smp_processor_id()); + data-error = rtas_call(data-token, 0, 1, NULL); + + if (data-error) + printk(KERN_DEBUG ibm,suspend-me returned %d\n, + data-error); } else { - data-waiting = -EBUSY; - printk(KERN_ERR Error on H_JOIN hypervisor call\n); + printk(KERN_ERR H_JOIN on cpu %i failed with rc = %ld\n, + smp_processor_id(), rc); + data-error = rc; } - + /* This cpu did the suspend or got an error; in either case
[PATCH] remove prod_processor()
prod_processor() is unused, and that's a good thing, since it does not supply the required proc id parameter to H_PROD. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/platforms/pseries/plpar_wrappers.h |5 - 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/pseries/plpar_wrappers.h b/arch/powerpc/platforms/pseries/plpar_wrappers.h index d003c80..d8680b5 100644 --- a/arch/powerpc/platforms/pseries/plpar_wrappers.h +++ b/arch/powerpc/platforms/pseries/plpar_wrappers.h @@ -8,11 +8,6 @@ static inline long poll_pending(void) return plpar_hcall_norets(H_POLL_PENDING); } -static inline long prod_processor(void) -{ - return plpar_hcall_norets(H_PROD); -} - static inline long cede_processor(void) { return plpar_hcall_norets(H_CEDE); -- 1.5.3.4.206.g58ba4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH v3] fix multiple bugs in rtas_ibm_suspend_me code
There are several issues with the rtas_ibm_suspend_me code, which enables platform-assisted suspension of an LPAR for migration or hibernation as covered in PAPR 2.2. 1.) rtas_ibm_suspend_me uses on_each_cpu() to invoke rtas_percpu_suspend_me on all cpus via IPI: if (on_each_cpu(rtas_percpu_suspend_me, data, 1, 0)) ... 'data' is on the calling task's stack, but rtas_ibm_suspend_me takes no measures to ensure that all instances of rtas_percpu_suspend_me are finished accessing 'data' before returning. This can result in the IPI'd cpus accessing random stack data and getting stuck in H_JOIN. This is addressed by using an atomic count of workers and a completion on the stack. 2.) rtas_percpu_suspend_me is needlessly calling H_JOIN in a loop. The only event that can cause a cpu to return from H_JOIN is an H_PROD from another cpu or a NMI/system reset. Each cpu need call H_JOIN only once per suspend operation. Remove the loop and the now unnecessary 'waiting' state variable. 3.) H_JOIN must be called with MSR[EE] off, but lazy interrupt disabling may cause the caller of rtas_ibm_suspend_me to call H_JOIN with it on; the local_irq_disable() in on_each_cpu() is not sufficient. Fix this by explicitly saving the MSR and clearing the EE bit before calling H_JOIN. 4.) H_PROD is being called with the Linux logical cpu number as the parameter, not the platform interrupt server value. (It's also being called for all possible cpus, which is harmless, but unnecessary.) This is fixed by calling H_PROD for each online cpu using get_hard_smp_processor_id(cpu) for the argument. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- Paul Mackerras wrote: On a uniprocessor configuration, with this patch I get: CC arch/powerpc/kernel/rtas.o /home/paulus/kernel/powerpc/arch/powerpc/kernel/rtas.c: In function ‘rtas_percpu_suspend_me’: /home/paulus/kernel/powerpc/arch/powerpc/kernel/rtas.c:702: error: implicit declaration of function ‘get_hard_smp_processor_id make[2]: *** [arch/powerpc/kernel/rtas.o] Error 1 I think you need to #include asm/smp.h in rtas.c. Rather sloppy of me, sorry. Changes since v2: - Add appropriate #includes for APIs used, fixing SMP=n build Changes since v1: - the completion is adequate for ensuring that all IPIs have completed, so there's no need for a mutex and rtas_suspend_me_data can be on the stack as before. - add fix for hard-disabling interrupts before H_JOIN - remove unnecessary H_JOIN loop and state machinery arch/powerpc/kernel/rtas.c | 99 ++-- 1 files changed, 58 insertions(+), 41 deletions(-) diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 2147807..52e95c2 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -19,6 +19,9 @@ #include linux/init.h #include linux/capability.h #include linux/delay.h +#include linux/smp.h +#include linux/completion.h +#include linux/cpumask.h #include asm/prom.h #include asm/rtas.h @@ -34,6 +37,8 @@ #include asm/lmb.h #include asm/udbg.h #include asm/syscalls.h +#include asm/smp.h +#include asm/atomic.h struct rtas_t rtas = { .lock = SPIN_LOCK_UNLOCKED @@ -41,8 +46,10 @@ struct rtas_t rtas = { EXPORT_SYMBOL(rtas); struct rtas_suspend_me_data { - long waiting; - struct rtas_args *args; + atomic_t working; /* number of cpus accessing this struct */ + int token; /* ibm,suspend-me */ + int error; + struct completion *complete; /* wait on this until working == 0 */ }; DEFINE_SPINLOCK(rtas_data_buf_lock); @@ -657,50 +664,62 @@ static int ibm_suspend_me_token = RTAS_UNKNOWN_SERVICE; #ifdef CONFIG_PPC_PSERIES static void rtas_percpu_suspend_me(void *info) { - int i; long rc; - long flags; + unsigned long msr_save; + int cpu; struct rtas_suspend_me_data *data = (struct rtas_suspend_me_data *)info; - /* -* We use waiting to indicate our state. As long -* as it is 0, we are still trying to all join up. -* If it goes to 0, we have successfully joined up and -* one thread got H_CONTINUE. If any error happens, -* we set it to 0. -*/ - local_irq_save(flags); - do { - rc = plpar_hcall_norets(H_JOIN); - smp_rmb(); - } while (rc == H_SUCCESS data-waiting 0); - if (rc == H_SUCCESS) - goto out; + atomic_inc(data-working); + + /* really need to ensure MSR.EE is off for H_JOIN */ + msr_save = mfmsr(); + mtmsr(msr_save ~(MSR_EE)); + + rc = plpar_hcall_norets(H_JOIN); + + mtmsr(msr_save); - if (rc == H_CONTINUE) { - data-waiting = 0; - data-args-args[data-args-nargs] = - rtas_call(ibm_suspend_me_token, 0, 1, NULL); - for_each_possible_cpu(i) - plpar_hcall_norets(H_PROD,i); + if (rc
Re: [PATCH v3] fix multiple bugs in rtas_ibm_suspend_me code
Nathan Lynch wrote: 3.) H_JOIN must be called with MSR[EE] off, but lazy interrupt disabling may cause the caller of rtas_ibm_suspend_me to call H_JOIN with it on; the local_irq_disable() in on_each_cpu() is not sufficient. Fix this by explicitly saving the MSR and clearing the EE bit before calling H_JOIN. ... + atomic_inc(data-working); + + /* really need to ensure MSR.EE is off for H_JOIN */ + msr_save = mfmsr(); + mtmsr(msr_save ~(MSR_EE)); + + rc = plpar_hcall_norets(H_JOIN); + + mtmsr(msr_save); BTW, I'm wondering if this is the right way to do this. I think there's the possibility that we could enter this routine hard-enabled and take take an interrupt between the mfmsr and the first mtmsr, but I haven't worked out all the implications. Would hard_irq_disable be better? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] use correct ccr bit for syscall error status
The powerpc implementations of syscall_get_error and syscall_set_return_value should use CCR0:S0 (0x1000) for testing and setting syscall error status. Fortunately these APIs don't seem to be used at the moment. Signed-off-by: Nathan Lynch n...@pobox.com --- arch/powerpc/include/asm/syscall.h |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h index efa7f0b..23913e9 100644 --- a/arch/powerpc/include/asm/syscall.h +++ b/arch/powerpc/include/asm/syscall.h @@ -30,7 +30,7 @@ static inline void syscall_rollback(struct task_struct *task, static inline long syscall_get_error(struct task_struct *task, struct pt_regs *regs) { - return (regs-ccr 0x1000) ? -regs-gpr[3] : 0; + return (regs-ccr 0x1000) ? -regs-gpr[3] : 0; } static inline long syscall_get_return_value(struct task_struct *task, @@ -44,10 +44,10 @@ static inline void syscall_set_return_value(struct task_struct *task, int error, long val) { if (error) { - regs-ccr |= 0x1000L; + regs-ccr |= 0x1000L; regs-gpr[3] = -error; } else { - regs-ccr = ~0x1000L; + regs-ccr = ~0x1000L; regs-gpr[3] = val; } } -- 1.6.0.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] [POWERPC] 85xx: Add next-level-cache property
Segher Boessenkool wrote: The PowerPC binding defines an l2-cache property for this (it points from CPU node to L2 cache node, from L2 cache node to L3 cache node, from L3 cache node to L4 cache node, etc.) So looking at the PPC binding its not terrible clear about l3-cache being a valid property. It isn't. The property is called l2-cache at every level. I believe the discussion w/ePAPR was to create something a bit more generic and clarify/update the PPC binding. Nasty. Sure, l2-cache isn't the nicest name to point to deeper cache levels, but introducing a new property with (substantially) the same semantics is worse. The semantics appear to be identical, even. There really shouldn't be a new property name until new functionality is introduced. For example, it could allow to describe more than one cache at each level (the current binding already allows more than one parent for each cache, but only one child; and cache hierarchies like that actually exist). I'm going to stick with the new binding as we don't use this linking currently. Dunno what's the best thing to do here. If you don't need the functionality yet, it might be best to postpone putting either property in there. Sigh, what a mess. Does existing practice count for anything? IBM pseries firmware uses the l2-cache property as described in the PowerPC binding. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 2/3] [2.6.26] ehea: Add dependency to Kconfig
Hannes Hering wrote: On Wednesday 28 May 2008 18:44:05 Nathan Lynch wrote: Hannes Hering wrote: The new ehea memory hot plug implementation depends on MEMORY_HOTPLUG. Signed-off-by: Hannes Hering [EMAIL PROTECTED] --- diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index f90a86b..181cd86 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -2440,7 +2440,7 @@ config CHELSIO_T3 config EHEA tristate eHEA Ethernet support - depends on IBMEBUS INET SPARSEMEM + depends on IBMEBUS INET SPARSEMEM MEMORY_HOTPLUG select INET_LRO ---help--- This driver supports the IBM pSeries eHEA ethernet adapter. I disagree with this change. It makes it impossible to build the ehea driver without memory hotplug enabled. Presumably, this commit was intended to work around a build break of this sort (with EHEA=m and MEMORY_HOTPLUG=n): drivers/net/ehea/ehea_qmr.c: In function 'ehea_create_busmap': drivers/net/ehea/ehea_qmr.c:635: error: implicit declaration of function 'walk_memory_resource' (some indication of this should have been in the commit message, btw) I think this was the wrong way to fix the issue. EHEA=m and MEMORY_HOTPLUG=n is a valid configuration for machines I test. Any thoughts on the following, which makes walk_memory_resource() available regardless of MEMORY_HOTPLUG's setting? I've tested it on a JS22 (Power6 blade). I agree that the ehea cannot be built without MEMORY_HOTPLUG. The problem is the fact that the ppc walk_memory_resource declaration is in the scope of MEMORY_HOTPLUG. At the moment I don't have complete overview if the move of the code as you propose in your patch has any side effects. We probably need to talk to Badari who provided the walk_memory_resource code. We can also just throw it onto one of our boxes to see what happens. ;) I would certainly appreciate any additional testing. You wrote the ehea code that uses walk_memory_resource, so I was hoping you could speak to whether ehea needs that interface when CONFIG_MEMORY_HOTPLUG=n. Or maybe there should be a no-op version of walk_memory_resource for that case? And what about the arch-independent version in kernel/resource.c? Badari? It would be nice to get this resolved for 2.6.26 -- this new dependency causes working 2.6.25 configs to effectively fail (by deselecting CONFIG_EHEA during make oldconfig). ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 1/2] make walk_memory_resource available with MEMORY_HOTPLUG=n
The ehea driver was recently changed[1] to use walk_memory_resource() to detect the system's memory layout. However, walk_memory_resource() is available only when memory hotplug is enabled. So CONFIG_EHEA was made to depend on MEMORY_HOTPLUG [2], but it is inappropriate for a network driver to have such a dependency. Make the declaration of walk_memory_resource() and its powerpc implementation (ehea is powerpc-specific) unconditionally available. [1] 48cfb14f8b89d4d5b3df6c16f08b258686fb12ad ehea: Add DLPAR memory remove support [2] fb7b6ca2b6b7c23b52be143bdd5f55a23b9780c8 ehea: Add dependency to Kconfig Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/mm/mem.c |3 +-- include/linux/memory_hotplug.h | 16 2 files changed, 9 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index f67e118..51f82d8 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -151,6 +151,7 @@ out: return ret; } #endif /* CONFIG_MEMORY_HOTREMOVE */ +#endif /* CONFIG_MEMORY_HOTPLUG */ /* * walk_memory_resource() needs to make sure there is no holes in a given @@ -184,8 +185,6 @@ walk_memory_resource(unsigned long start_pfn, unsigned long nr_pages, void *arg, } EXPORT_SYMBOL_GPL(walk_memory_resource); -#endif /* CONFIG_MEMORY_HOTPLUG */ - void show_mem(void) { unsigned long total = 0, reserved = 0; diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 73e3586..ea9f5ad 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -77,14 +77,6 @@ extern int __add_pages(struct zone *zone, unsigned long start_pfn, extern int __remove_pages(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages); -/* - * Walk through all memory which is registered as resource. - * arg is (start_pfn, nr_pages, private_arg_pointer) - */ -extern int walk_memory_resource(unsigned long start_pfn, - unsigned long nr_pages, void *arg, - int (*func)(unsigned long, unsigned long, void *)); - #ifdef CONFIG_NUMA extern int memory_add_physaddr_to_nid(u64 start); #else @@ -199,6 +191,14 @@ static inline void register_page_bootmem_info_node(struct pglist_data *pgdat) #endif /* ! CONFIG_MEMORY_HOTPLUG */ +/* + * Walk through all memory which is registered as resource. + * arg is (start_pfn, nr_pages, private_arg_pointer) + */ +extern int walk_memory_resource(unsigned long start_pfn, + unsigned long nr_pages, void *arg, + int (*func)(unsigned long, unsigned long, void *)); + extern int add_memory(int nid, u64 start, u64 size); extern int arch_add_memory(int nid, u64 start, u64 size); extern int remove_memory(u64 start, u64 size); -- 1.5.5 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 2/2] ehea: remove dependency on MEMORY_HOTPLUG
Now that walk_memory_resource() is available regardless of MEMORY_HOTPLUG's setting, this dependency is not needed. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- drivers/net/Kconfig |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index dd0ec9e..f4182cf 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -2426,7 +2426,7 @@ config CHELSIO_T3 config EHEA tristate eHEA Ethernet support - depends on IBMEBUS INET SPARSEMEM MEMORY_HOTPLUG + depends on IBMEBUS INET SPARSEMEM select INET_LRO ---help--- This driver supports the IBM pSeries eHEA ethernet adapter. -- 1.5.5 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/2] make walk_memory_resource available with MEMORY_HOTPLUG=n
Nathan Lynch wrote: The ehea driver was recently changed[1] to use walk_memory_resource() to detect the system's memory layout. However, walk_memory_resource() is available only when memory hotplug is enabled. So CONFIG_EHEA was made to depend on MEMORY_HOTPLUG [2], but it is inappropriate for a network driver to have such a dependency. Make the declaration of walk_memory_resource() and its powerpc implementation (ehea is powerpc-specific) unconditionally available. Paul, can you take these two patches, or should I send them to akpm? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] Power5,Power6 BSR driver
Hi, mainly a couple of coding style things, but one minor bug (I think). [EMAIL PROTECTED] wrote: From: Sonny Rao [EMAIL PROTECTED] +static int bsr_mmap(struct file *filp, struct vm_area_struct *vma) +{ + unsigned long size = vma-vm_end - vma-vm_start; + struct bsr_dev *dev = filp-private_data; + + if (size dev-bsr_len || (size (PAGE_SIZE-1))) + return -EINVAL; + + vma-vm_flags |= (VM_IO | VM_DONTEXPAND); + vma-vm_page_prot = pgprot_noncached(vma-vm_page_prot); + + if (io_remap_pfn_range(vma, vma-vm_start, dev-bsr_addr PAGE_SHIFT, +size, vma-vm_page_prot)) + return -EAGAIN; Indentation is wrong. +static void bsr_cleanup_devs(void) +{ + int i; + for (i=0 ; i num_bsr_devs; i++) { i = 0 + struct bsr_dev *cur = bsr_devs + i; + if (cur-bsr_device) { + cdev_del(cur-bsr_cdev); + device_del(cur-bsr_device); + } + } + + kfree(bsr_devs); +} + +static int bsr_create_devs(struct device_node *bn) +{ + int reg_len, bsr_stride_len, bsr_bytes_len; + const u64 *reg; + const u32 *bsr_stride; + const u32 *bsr_bytes; + unsigned i; + + reg= of_get_property(bn, reg, reg_len); + bsr_stride = of_get_property(bn, ibm,lock-stride, bsr_stride_len); + bsr_bytes = of_get_property(bn, ibm,#lock-bytes, bsr_bytes_len); + + if (!reg || !bsr_stride || !bsr_bytes || + (bsr_stride_len != bsr_bytes_len) || + (bsr_stride_len/4 != reg_len/16)) { bsr_stride_len / 4 != reg_len / 16 + printk(KERN_ERR bsr of-node has missing/incorrect property\n); + return -ENODEV; + } ... +static int __init bsr_init(void) +{ + struct device_node *np; + dev_t bsr_dev = MKDEV(bsr_major, 0); + int ret = -ENODEV; + int result; + + np = of_find_compatible_node(NULL, ibm,bsr, ibm,bsr); + if (!np) + goto out_err; + + bsr_class = class_create(THIS_MODULE, bsr); + if (IS_ERR(bsr_class)) { + printk(KERN_ERR class_create() failed for bsr_class\n); + goto out_err; At this point I think you can leak a reference to np. + } + bsr_class-dev_attrs = bsr_dev_attrs; + + result = alloc_chrdev_region(bsr_dev, 0, BSR_MAX_DEVS, bsr); + bsr_major = MAJOR(bsr_dev); + if (result 0) { + printk(KERN_ERR alloc_chrdev_region() failed for bsr\n); + goto out_err_1; + } + + if ((ret = bsr_create_devs(np)) 0) + goto out_err_2; + + of_node_put(np); + + return 0; + + out_err_2: + unregister_chrdev_region(bsr_dev, BSR_MAX_DEVS); + + out_err_1: + class_destroy(bsr_class); + of_node_put(np); + + out_err: + + return ret; +} ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[RFC/PATCH 0/3] sched: allow arch override of cpu power
There is an interesting quality of POWER6 cores, which each have 2 hardware threads: assuming one thread on the core is idle, the primary thread is a little faster than the secondary thread. To illustrate: for cpumask in 0x1 0x2 ; do taskset $cpumask /usr/bin/time -f %e elapsed, %U user, %S sys \ /bin/sh -c i=100 ; while (( i-- )) ; do : ; done done 17.05 elapsed, 16.83 user, 0.22 sys 17.54 elapsed, 17.32 user, 0.22 sys (The first result is for a primary thread; the second result for a secondary thread.) So it would be nice to have the scheduler slightly prefer primary threads on POWER6 machines. These patches, which allow the architecture to override the scheduler's CPU power calculation, are one possible approach, but I'm open to others. Please note: these seemed to have the desired effect on 2.6.25-rc kernels (2-3% improvement in a kernbench-like make -j nr_cores), but I'm not seeing this improvement with 2.6.26-rc kernels for some reason I am still trying to track down. Nathan Lynch (3): sched: support arch override of sched_group cpu power add cpu_power to machdep_calls, override SD_SIBLING_INIT adjust cpu power for secondary threads on POWER6 arch/powerpc/kernel/setup-common.c |7 ++ arch/powerpc/platforms/pseries/setup.c | 37 include/asm-powerpc/cputable.h |3 +- include/asm-powerpc/machdep.h |2 + include/asm-powerpc/topology.h | 31 ++ include/linux/sched.h |1 + kernel/sched.c | 14 7 files changed, 94 insertions(+), 1 deletions(-) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[RFC/PATCH 2/3] add cpu_power to machdep_calls, override SD_SIBLING_INIT
Add a cpu_power() call to machdep_calls, which will allow platforms to override the scheduler's default cpu power calculation. If the platform does not provide a cpu_power() method, the scheduler's default value is used. Copy the default SD_SIBLING_INIT to powerpc's topology.h and add the SD_ASYM_CPU_POWER flag, which will cause ppc_md.cpu_power() to be invoked (via arch_cpu_power()) during sched domain initialization. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/kernel/setup-common.c |7 +++ include/asm-powerpc/machdep.h |2 ++ include/asm-powerpc/topology.h | 31 +++ 3 files changed, 40 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index db540ea..609d09e 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -671,3 +671,10 @@ static int powerpc_debugfs_init(void) } arch_initcall(powerpc_debugfs_init); #endif + +unsigned int arch_cpu_power(int cpu, unsigned int default_power) +{ + if (ppc_md.cpu_power) + return ppc_md.cpu_power(cpu, default_power); + return default_power; +} diff --git a/include/asm-powerpc/machdep.h b/include/asm-powerpc/machdep.h index 54ed64d..de6ff6b 100644 --- a/include/asm-powerpc/machdep.h +++ b/include/asm-powerpc/machdep.h @@ -260,6 +260,8 @@ struct machdep_calls { void (*suspend_disable_irqs)(void); void (*suspend_enable_irqs)(void); #endif + /* Override scheduler's cpu power calculation */ + unsigned int (*cpu_power)(int cpu, unsigned int default_power); }; extern void power4_idle(void); diff --git a/include/asm-powerpc/topology.h b/include/asm-powerpc/topology.h index 100c6fb..4335c15 100644 --- a/include/asm-powerpc/topology.h +++ b/include/asm-powerpc/topology.h @@ -72,6 +72,37 @@ static inline int pcibus_to_node(struct pci_bus *bus) .nr_balance_failed = 0,\ } +#define SD_SIBLING_INIT (struct sched_domain) {\ + .span = CPU_MASK_NONE,\ + .parent = NULL, \ + .child = NULL, \ + .groups = NULL, \ + .min_interval = 1,\ + .max_interval = 2,\ + .busy_factor= 64, \ + .imbalance_pct = 110, \ + .cache_nice_tries = 0,\ + .busy_idx = 0,\ + .idle_idx = 0,\ + .newidle_idx= 0,\ + .wake_idx = 0,\ + .forkexec_idx = 0,\ + .flags = SD_LOAD_BALANCE \ + | SD_BALANCE_NEWIDLE\ + | SD_BALANCE_FORK \ + | SD_BALANCE_EXEC \ + | SD_WAKE_AFFINE\ + | SD_WAKE_IDLE \ + | SD_SHARE_CPUPOWER \ + | SD_ASYM_CPU_POWER,\ + .last_balance = jiffies, \ + .balance_interval = 1,\ + .nr_balance_failed = 0,\ +} + +#define arch_cpu_power(cpu, power) arch_cpu_power(cpu, power) +extern unsigned int arch_cpu_power(int cpu, unsigned int default_power); + extern void __init dump_numa_cpu_topology(void); extern int sysfs_add_device_to_node(struct sys_device *dev, int nid); -- 1.5.5 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[RFC/PATCH 1/3] sched: support arch override of sched_group cpu power
Add a new sched domain flag, SD_ASYM_CPU_POWER, which signifies that the architecture may override the cpu power for a cpu via a hook in init_sched_groups_power(). Add a dummy definition of arch_cpu_power() which conforms with the existing behavior. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- include/linux/sched.h |1 + kernel/sched.c| 14 ++ 2 files changed, 15 insertions(+), 0 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index c5d3f84..cfbefca 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -713,6 +713,7 @@ enum cpu_idle_type { #define SD_SHARE_PKG_RESOURCES 512 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 1024/* Only a single load balancing instance */ #define SD_WAKE_IDLE_FAR 2048/* Gain latency sacrificing cache hit */ +#define SD_ASYM_CPU_POWER 4096/* Domain members of unequal power */ #define BALANCE_FOR_MC_POWER \ (sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0) diff --git a/kernel/sched.c b/kernel/sched.c index eaf6751..3fba083 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -6761,6 +6761,13 @@ static void free_sched_groups(const cpumask_t *cpu_map, cpumask_t *nodemask) } #endif +#ifndef arch_cpu_power +static inline unsigned int arch_cpu_power(int cpu, unsigned int default_power) +{ + return default_power; +} +#endif + /* * Initialize sched groups cpu_power. * @@ -6789,6 +6796,13 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) sd-groups-__cpu_power = 0; + if (!child (sd-flags SD_ASYM_CPU_POWER)) { + unsigned int power = arch_cpu_power(cpu, SCHED_LOAD_SCALE); + + sg_inc_cpu_power(sd-groups, power); + return; + } + /* * For perf policy, if the groups in child domain share resources * (for example cores sharing some portions of the cache hierarchy -- 1.5.5 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[RFC/PATCH 3/3] adjust cpu power for secondary threads on POWER6
On POWER6 processors, cpu-bound programs generally perform better on the primary thread when the secondary thread is idle than they do on the secondary thread while the primary thread is idle. This difference can be observed by timing a simple shell loop: for cpumask in 0x1 0x2 ; do taskset $cpumask /usr/bin/time -f %e elapsed, %U user, %S sys \ /bin/sh -c i=100 ; while (( i-- )) ; do : ; done done 17.05 elapsed, 16.83 user, 0.22 sys 17.54 elapsed, 17.32 user, 0.22 sys (The first result is for a primary thread; the second result for a secondary thread.) So we want the CPU scheduler to slightly favor primary threads on POWER6. Add a new cpu feature bit which indicates the need to override the scheduler's cpu power calculation. Implement ppc_md.cpu_power for the pseries platform, and scale secondary threads' cpu power to 97% of the (default) primary threads' cpu power. Allow this percentage to be overriden on the kernel command line via sec_thread_power_scale=. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/platforms/pseries/setup.c | 37 include/asm-powerpc/cputable.h |3 +- 2 files changed, 39 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index f5d29f5..a1141c0 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -501,6 +501,42 @@ void pSeries_power_off(void) for (;;); } +/* Percentage by which the cpu power of secondary threads is adjusted */ +static unsigned int sec_thread_power_scale = 97; +static int __init setup_sec_thread_power_scale(char *str) +{ + int power; + + if (get_option(str, power) power 0 power = 100) + sec_thread_power_scale = power; + + return 1; +} +__setup(sec_thread_power_scale=, setup_sec_thread_power_scale); + +static unsigned int pseries_cpu_power(int cpu, unsigned int default_power) +{ + struct device_node *np; + unsigned int thread, power; + + if (!cpu_has_feature(CPU_FTR_ASYM_POWER)) + return default_power; + + power = default_power; + + np = of_get_cpu_node(cpu, thread); + WARN_ON(!np); + if (!np) + goto out; + + /* If this isn't a primary thread, scale the power */ + if (thread != 0) + power = default_power * sec_thread_power_scale / 100; +out: + of_node_put(np); + return power; +} + #ifndef CONFIG_PCI void pSeries_final_fixup(void) { } #endif @@ -525,4 +561,5 @@ define_machine(pseries) { .progress = rtas_progress, .system_reset_exception = pSeries_system_reset_exception, .machine_check_exception = pSeries_machine_check_exception, + .cpu_power = pseries_cpu_power, }; diff --git a/include/asm-powerpc/cputable.h b/include/asm-powerpc/cputable.h index 1e79673..eb886a9 100644 --- a/include/asm-powerpc/cputable.h +++ b/include/asm-powerpc/cputable.h @@ -152,6 +152,7 @@ extern void do_feature_fixups(unsigned long value, void *fixup_start, #define CPU_FTR_UNIFIED_ID_CACHE ASM_CONST(0x0100) #define CPU_FTR_SPEASM_CONST(0x0200) #define CPU_FTR_NEED_PAIRED_STWCX ASM_CONST(0x0400) +#define CPU_FTR_ASYM_POWER ASM_CONST(0x0800) /* * Add the 64-bit processor unique features in the top half of the word; @@ -375,7 +376,7 @@ extern void do_feature_fixups(unsigned long value, void *fixup_start, CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR) + CPU_FTR_DSCR | CPU_FTR_ASYM_POWER) #define CPU_FTRS_CELL (CPU_FTR_USE_TB | \ CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ -- 1.5.5 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [RFC/PATCH 0/3] sched: allow arch override of cpu power
Ingo Molnar wrote: * Nathan Lynch [EMAIL PROTECTED] wrote: So it would be nice to have the scheduler slightly prefer primary threads on POWER6 machines. These patches, which allow the architecture to override the scheduler's CPU power calculation, are one possible approach, but I'm open to others. Please note: these seemed to have the desired effect on 2.6.25-rc kernels (2-3% improvement in a kernbench-like make -j nr_cores), but I'm not seeing this improvement with 2.6.26-rc kernels for some reason I am still trying to track down. ok, i guess that discrepancy has to be tracked down before we can think about these patches - but the principle is OK. Great. I'll keep trying to figure out what's going on there. One problem is that the whole cpu-power balancing code in sched.c is a bit ... unclear and under-documented. So any change to this area should begin at documenting the basics: what do the units mean exactly, how are they used in balancing and what is the desired effect. I'd not be surprised if there were a few buglets in this area, SMT is not at the forefront of testing at the moment. There's nothing spectacularly broken in it (i have a HT machine myself), but the concepts have bitrotten a bit. Patches - even if they just add comments - are welcome :-) Okay, I'll have a look. Thanks Ingo. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [RFC/PATCH 0/3] sched: allow arch override of cpu power
Breno Leitao wrote: Nathan Lynch wrote: There is an interesting quality of POWER6 cores, which each have 2 hardware threads: assuming one thread on the core is idle, the primary thread is a little faster than the secondary thread. To illustrate: I found this feature interesting and decided to do some tests. After some tests I found that the example you post really runs fast in the first CPU, but a more elaborated application runs slower on the first CPU. Here is a small example: # taskset 0x1 time -f %e, %U, %S ./a.out ; taskset 0x2 time -f %e, %U, %S ./a.out 10.77, 10.72, 0.01 10.53, 10.48, 0.01 # taskset 0x2 time -f %e, %U, %S ./a.out ; taskset 0x1 time -f %e, %U, %S ./a.out 10.55, 10.50, 0.01 10.77, 10.72, 0.01 I've been able to duplicate your results, thanks for the testcase. Guess I'll need to understand what's going on before continuing with this... ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] Add PPC_FEATURE_PMU_COMPAT
Beginning with Power6, there is a set of 32 PMU events which is compatible across POWER processor lines. PPC_FEATURE_PMU_COMPAT indicates support for this subset. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/kernel/cputable.c |4 ++-- include/asm-powerpc/cputable.h |1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c index 817cea1..a9c90be 100644 --- a/arch/powerpc/kernel/cputable.c +++ b/arch/powerpc/kernel/cputable.c @@ -70,10 +70,10 @@ extern void __restore_cpu_power7(void); PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP) #define COMMON_USER_POWER6 (COMMON_USER_PPC64 | PPC_FEATURE_ARCH_2_05 |\ PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | \ -PPC_FEATURE_TRUE_LE) +PPC_FEATURE_TRUE_LE | PPC_FEATURE_PMU_COMPAT) #define COMMON_USER_POWER7 (COMMON_USER_PPC64 | PPC_FEATURE_ARCH_2_06 |\ PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | \ -PPC_FEATURE_TRUE_LE) +PPC_FEATURE_TRUE_LE | PPC_FEATURE_PMU_COMPAT) #define COMMON_USER_PA6T (COMMON_USER_PPC64 | PPC_FEATURE_PA6T |\ PPC_FEATURE_TRUE_LE | \ PPC_FEATURE_HAS_ALTIVEC_COMP) diff --git a/include/asm-powerpc/cputable.h b/include/asm-powerpc/cputable.h index 4e4491c..a9c1de3 100644 --- a/include/asm-powerpc/cputable.h +++ b/include/asm-powerpc/cputable.h @@ -26,6 +26,7 @@ #define PPC_FEATURE_POWER6_EXT 0x0200 #define PPC_FEATURE_ARCH_2_06 0x0100 #define PPC_FEATURE_HAS_VSX0x0080 +#define PPC_FEATURE_PMU_COMPAT 0x0040 #define PPC_FEATURE_TRUE_LE0x0002 #define PPC_FEATURE_PPC_LE 0x0001 -- 1.5.5.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 1/2] elf loader support for auxvec base platform string
Some IBM POWER-based platforms have the ability to run in a mode which mostly appears to the OS as a different processor from the actual hardware. For example, a Power6 system may appear to be a Power5+, which makes the AT_PLATFORM value power5+. However, some applications (virtual machines, optimized libraries) can benefit from knowledge of the underlying CPU model. A new aux vector entry, AT_BASE_PLATFORM, will denote the actual hardware. For example, on a Power6 system in Power5+ compatibility mode, AT_PLATFORM will be power5+ and AT_BASE_PLATFORM will be power6. If the architecture has defined ELF_BASE_PLATFORM, copy that value to the user stack in the same manner as ELF_PLATFORM. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- Next patch implements ELF/AT_BASE_PLATFORM for powerpc. fs/binfmt_elf.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index d48ff5f..834c2c4 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -131,6 +131,10 @@ static int padzero(unsigned long elf_bss) #define STACK_ALLOC(sp, len) ({ sp -= len ; sp; }) #endif +#ifndef ELF_BASE_PLATFORM +#define ELF_BASE_PLATFORM NULL +#endif + static int create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, unsigned long load_addr, unsigned long interp_load_addr) @@ -142,7 +146,9 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, elf_addr_t __user *envp; elf_addr_t __user *sp; elf_addr_t __user *u_platform; + elf_addr_t __user *u_base_platform; const char *k_platform = ELF_PLATFORM; + const char *k_base_platform = ELF_BASE_PLATFORM; int items; elf_addr_t *elf_info; int ei_index = 0; @@ -172,6 +178,19 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, return -EFAULT; } + /* +* If this architecture has a base platform capability +* string, copy it to userspace. +*/ + u_base_platform = NULL; + if (k_base_platform) { + size_t len = strlen(k_base_platform) + 1; + + u_base_platform = (elf_addr_t __user *)STACK_ALLOC(p, len); + if (__copy_to_user(u_base_platform, k_base_platform, len)) + return -EFAULT; + } + /* Create the ELF interpreter info */ elf_info = (elf_addr_t *)current-mm-saved_auxv; /* update AT_VECTOR_SIZE_BASE if the number of NEW_AUX_ENT() changes */ @@ -208,6 +227,10 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, NEW_AUX_ENT(AT_PLATFORM, (elf_addr_t)(unsigned long)u_platform); } + if (k_base_platform) { + NEW_AUX_ENT(AT_BASE_PLATFORM, + (elf_addr_t)(unsigned long)u_base_platform); + } if (bprm-interp_flags BINPRM_FLAGS_EXECFD) { NEW_AUX_ENT(AT_EXECFD, bprm-interp_data); } -- 1.5.5.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 2/2] enable AT_BASE_PLATFORM aux vector for powerpc
Stash the first platform string matched by identify_cpu() in powerpc_base_platform, and supply that to the ELF loader for the value of AT_BASE_PLATFORM. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- Note: I wasn't sure how to pick the index value for AT_BASE_PLATFORM; I simply searched include/ for all AT_ values and added 1 to the highest value found. arch/powerpc/kernel/cputable.c | 11 +++ include/asm-powerpc/auxvec.h |3 +++ include/asm-powerpc/cputable.h |2 ++ include/asm-powerpc/elf.h |8 4 files changed, 24 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c index a9c90be..f1eb632 100644 --- a/arch/powerpc/kernel/cputable.c +++ b/arch/powerpc/kernel/cputable.c @@ -23,6 +23,9 @@ struct cpu_spec* cur_cpu_spec = NULL; EXPORT_SYMBOL(cur_cpu_spec); +/* The platform string corresponding to the real PVR */ +const char *powerpc_base_platform; + /* NOTE: * Unlike ppc32, ppc64 will only call this once for the boot CPU, it's * the responsibility of the appropriate CPU save/restore functions to @@ -1620,6 +1623,14 @@ struct cpu_spec * __init identify_cpu(unsigned long offset, unsigned int pvr) } else *t = *s; *PTRRELOC(cur_cpu_spec) = the_cpu_spec; + + /* +* Set the base platform string once; assumes +* we're called with real pvr first. +*/ + if (powerpc_base_platform == NULL) + powerpc_base_platform = t-platform; + #if defined(CONFIG_PPC64) || defined(CONFIG_BOOKE) /* ppc64 and booke expect identify_cpu to also call * setup_cpu for that processor. I will consolidate diff --git a/include/asm-powerpc/auxvec.h b/include/asm-powerpc/auxvec.h index 19a099b..507f081 100644 --- a/include/asm-powerpc/auxvec.h +++ b/include/asm-powerpc/auxvec.h @@ -16,4 +16,7 @@ */ #define AT_SYSINFO_EHDR33 +/* The actual hardware platform; may differ from AT_PLATFORM */ +#define AT_BASE_PLATFORM 38 + #endif diff --git a/include/asm-powerpc/cputable.h b/include/asm-powerpc/cputable.h index a9c1de3..527152f 100644 --- a/include/asm-powerpc/cputable.h +++ b/include/asm-powerpc/cputable.h @@ -125,6 +125,8 @@ extern struct cpu_spec *identify_cpu(unsigned long offset, unsigned int pvr); extern void do_feature_fixups(unsigned long value, void *fixup_start, void *fixup_end); +extern const char *powerpc_base_platform; + #endif /* __ASSEMBLY__ */ /* CPU kernel features */ diff --git a/include/asm-powerpc/elf.h b/include/asm-powerpc/elf.h index 38a5172..ff0c947 100644 --- a/include/asm-powerpc/elf.h +++ b/include/asm-powerpc/elf.h @@ -237,6 +237,14 @@ extern int dump_task_altivec(struct task_struct *, elf_vrregset_t *vrregs); #define ELF_PLATFORM (cur_cpu_spec-platform) +/* While ELF_PLATFORM indicates the ISA supported by the platform, it + * may not accurately reflect the underlying behavior of the hardware + * (as in the case of running in Power5+ compatibility mode on a + * Power6 machine). ELF_BASE_PLATFORM allows ld.so to load libraries + * that are tuned for the real hardware. + */ +#define ELF_BASE_PLATFORM (powerpc_base_platform) + #ifdef __powerpc64__ # define ELF_PLAT_INIT(_r, load_addr) do {\ _r-gpr[2] = load_addr; \ -- 1.5.5.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/2] elf loader support for auxvec base platform string
Mikael Pettersson wrote: Nathan Lynch writes: Some IBM POWER-based platforms have the ability to run in a mode which mostly appears to the OS as a different processor from the actual hardware. For example, a Power6 system may appear to be a Power5+, which makes the AT_PLATFORM value power5+. However, some applications (virtual machines, optimized libraries) can benefit from knowledge of the underlying CPU model. A new aux vector entry, AT_BASE_PLATFORM, will denote the actual hardware. For example, on a Power6 system in Power5+ compatibility mode, AT_PLATFORM will be power5+ and AT_BASE_PLATFORM will be power6. Why on earth would you ever want AT_PLATFORM to differ from AT_BASE_PLATFORM? In cases that matter you admit that AT_BASE_PLATFORM takes precedence, so why involve a fake lame not-quite-the-platform in the first place? Workaround for buggy software? My apologies, I did not explain the motivation well. The idea is that while AT_PLATFORM indicates the instruction set supported, AT_BASE_PLATFORM indicates the underlying microarchitecture. It's not a matter of buggy software, or of one value taking precedence over the other. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/2] elf loader support for auxvec base platform string
Roland McGrath wrote: Well, we use strings to represent the platforms already (ie, the actual CPU microarchitecture). Fitting those into bits would be annoying, it Then use dsocaps. makes sense to have AT_BASE_PLATFORM to be the base variant of AT_PLATFORM. I understand why you think so. But let's not be too abstract. The purpose of the addition is to drive ld.so's selection of libraries, yes? That is one purpose. But there are others (JVMs, performance tools). dsocaps seems to be an ld.so-specific thing... or am I missing how a third-party program would use it? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] Add PPC_FEATURE_PMU_COMPAT
Kumar Gala wrote: On Jul 3, 2008, at 6:20 PM, Nathan Lynch wrote: Beginning with Power6, there is a set of 32 PMU events which is compatible across POWER processor lines. PPC_FEATURE_PMU_COMPAT indicates support for this subset. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/kernel/cputable.c |4 ++-- include/asm-powerpc/cputable.h |1 + 2 files changed, 3 insertions(+), 2 deletions(-) Can you explain what these PMU events are a bit further? Maynard, can you help out here...? :) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/2] elf loader support for auxvec base platform string
Benjamin Herrenschmidt wrote: On Thu, 2008-07-03 at 19:19 -0700, Roland McGrath wrote: Why not just use ELF_HWCAP for this? It looks like powerpc only has 3 bits left there (keeping it to 32), but 3 is not 0. If not that, why not use dsocaps? That is, some magic in the vDSO, which glibc already supports on all machines where it uses the vDSO. (For how it works, see the use in arch/x86/vdso/vdso32/note.S for CONFIG_XEN.) Well, we use strings to represent the platforms already (ie, the actual CPU microarchitecture). Fitting those into bits would be annoying, it makes sense to have AT_BASE_PLATFORM to be the base variant of AT_PLATFORM. _However_ there is a bug in that this patch adds an entry without bumping the number of entries in the cached array (ie. AT_VECTOR_SIZE_BASE needs to be updated). Ugh, yes. I was hoping to work this in such a way that AT_VECTOR_SIZE (and thus the size of mm_struct) increases only for architectures that implement AT_BASE_PLATFORM... would it be wrong to account for it in AT_VECTOR_SIZE_ARCH? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH v2] Add PPC_FEATURE_PSERIES_PMU_COMPAT
Background from Maynard Johnson: As of POWER6, a set of 32 common events is defined that must be supported on all future POWER processors. The main impetus for this compat set is the need to support partition migration, especially from processor P(n) to processor P(n+1), where performance software that's running in the new partition may not be knowledgeable about processor P(n+1). If a performance tool determines it does not support the physical processor, but is told (via the PPC_FEATURE_PSERIES_PMU_COMPAT bit) that the processor supports the notion of the PMU compat set, then the performance tool can surface just those events to the user of the tool. PPC_FEATURE_PSERIES_PMU_COMPAT indicates that the PMU supports at least this basic subset of events which is compatible across POWER processor lines. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- Changes since v1: - make name of feature bit less generic - provide more complete changelog arch/powerpc/kernel/cputable.c |6 -- include/asm-powerpc/cputable.h |1 + 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c index 817cea1..c4eb377 100644 --- a/arch/powerpc/kernel/cputable.c +++ b/arch/powerpc/kernel/cputable.c @@ -70,10 +70,12 @@ extern void __restore_cpu_power7(void); PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP) #define COMMON_USER_POWER6 (COMMON_USER_PPC64 | PPC_FEATURE_ARCH_2_05 |\ PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | \ -PPC_FEATURE_TRUE_LE) +PPC_FEATURE_TRUE_LE | \ +PPC_FEATURE_PSERIES_PMU_COMPAT) #define COMMON_USER_POWER7 (COMMON_USER_PPC64 | PPC_FEATURE_ARCH_2_06 |\ PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | \ -PPC_FEATURE_TRUE_LE) +PPC_FEATURE_TRUE_LE | \ +PPC_FEATURE_PSERIES_PMU_COMPAT) #define COMMON_USER_PA6T (COMMON_USER_PPC64 | PPC_FEATURE_PA6T |\ PPC_FEATURE_TRUE_LE | \ PPC_FEATURE_HAS_ALTIVEC_COMP) diff --git a/include/asm-powerpc/cputable.h b/include/asm-powerpc/cputable.h index 3171ac9..d1492a2 100644 --- a/include/asm-powerpc/cputable.h +++ b/include/asm-powerpc/cputable.h @@ -26,6 +26,7 @@ #define PPC_FEATURE_POWER6_EXT 0x0200 #define PPC_FEATURE_ARCH_2_06 0x0100 #define PPC_FEATURE_HAS_VSX0x0080 +#define PPC_FEATURE_PSERIES_PMU_COMPAT 0x0040 #define PPC_FEATURE_TRUE_LE0x0002 #define PPC_FEATURE_PPC_LE 0x0001 -- 1.5.6.2 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] kill useless SMT code in prom_hold_cpus
I think this code that counts SMT threads and compares against NR_CPUS is an artifact of pre-powerpc-merge ppc64. We care about starting only primary threads in the OF client code. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/kernel/prom_init.c | 39 +++ 1 files changed, 3 insertions(+), 36 deletions(-) diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c index 1ea8c8d..b1dd86c 100644 --- a/arch/powerpc/kernel/prom_init.c +++ b/arch/powerpc/kernel/prom_init.c @@ -205,8 +205,6 @@ static int __initdata mem_reserve_cnt; static cell_t __initdata regbuf[1024]; -#define MAX_CPU_THREADS 2 - /* * Error results ... some OF calls will return -1 on error, some * will return 0, some will return either. To simplify, here are @@ -1332,10 +1330,6 @@ static void __init prom_hold_cpus(void) unsigned int reg; phandle node; char type[64]; - int cpuid = 0; - unsigned int interrupt_server[MAX_CPU_THREADS]; - unsigned int cpu_threads, hw_cpu_num; - int propsize; struct prom_t *_prom = RELOC(prom); unsigned long *spinloop = (void *) LOW_ADDR(__secondary_hold_spinloop); @@ -1379,7 +1373,6 @@ static void __init prom_hold_cpus(void) reg = -1; prom_getprop(node, reg, reg, sizeof(reg)); - prom_debug(\ncpuid= 0x%x\n, cpuid); prom_debug(cpu hw idx = 0x%x\n, reg); /* Init the acknowledge var which will be reset by @@ -1388,28 +1381,9 @@ static void __init prom_hold_cpus(void) */ *acknowledge = (unsigned long)-1; - propsize = prom_getprop(node, ibm,ppc-interrupt-server#s, - interrupt_server, - sizeof(interrupt_server)); - if (propsize 0) { - /* no property. old hardware has no SMT */ - cpu_threads = 1; - interrupt_server[0] = reg; /* fake it with phys id */ - } else { - /* We have a threaded processor */ - cpu_threads = propsize / sizeof(u32); - if (cpu_threads MAX_CPU_THREADS) { - prom_printf(SMT: too many threads!\n - SMT: found %x, max is %x\n, - cpu_threads, MAX_CPU_THREADS); - cpu_threads = 1; /* ToDo: panic? */ - } - } - - hw_cpu_num = interrupt_server[0]; - if (hw_cpu_num != _prom-cpu) { + if (reg != _prom-cpu) { /* Primary Thread of non-boot cpu */ - prom_printf(%x : starting cpu hw idx %x... , cpuid, reg); + prom_printf(starting cpu hw idx %x... , reg); call_prom(start-cpu, 3, 0, node, secondary_hold, reg); @@ -1424,17 +1398,10 @@ static void __init prom_hold_cpus(void) } #ifdef CONFIG_SMP else - prom_printf(%x : boot cpu %x\n, cpuid, reg); + prom_printf(boot cpu hw idx %x\n, reg); #endif /* CONFIG_SMP */ - - /* Reserve cpu #s for secondary threads. They start later. */ - cpuid += cpu_threads; } - if (cpuid NR_CPUS) - prom_printf(WARNING: maximum CPUs ( __stringify(NR_CPUS) - ) exceeded: ignoring extras\n); - prom_debug(prom_hold_cpus: end...\n); } -- 1.5.6.2 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH v3] Add PPC_FEATURE_PSERIES_PERFMON_COMPAT
Background from Maynard Johnson: As of POWER6, a set of 32 common events is defined that must be supported on all future POWER processors. The main impetus for this compat set is the need to support partition migration, especially from processor P(n) to processor P(n+1), where performance software that's running in the new partition may not be knowledgeable about processor P(n+1). If a performance tool determines it does not support the physical processor, but is told (via the PPC_FEATURE_PSERIES_PERFMON_COMPAT bit) that the processor supports the notion of the PMU compat set, then the performance tool can surface just those events to the user of the tool. PPC_FEATURE_PSERIES_PERFMON_COMPAT indicates that the PMU supports at least this basic subset of events which is compatible across POWER processor lines. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- Changes since v2: - Further disambiguated name of feature bit (PMU - PERFMON) Changes since v1: - make name of feature bit less generic - provide more complete changelog arch/powerpc/kernel/cputable.c |6 -- include/asm-powerpc/cputable.h |3 +++ 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c index 817cea1..02088b0 100644 --- a/arch/powerpc/kernel/cputable.c +++ b/arch/powerpc/kernel/cputable.c @@ -70,10 +70,12 @@ extern void __restore_cpu_power7(void); PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP) #define COMMON_USER_POWER6 (COMMON_USER_PPC64 | PPC_FEATURE_ARCH_2_05 |\ PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | \ -PPC_FEATURE_TRUE_LE) +PPC_FEATURE_TRUE_LE | \ +PPC_FEATURE_PSERIES_PERFMON_COMPAT) #define COMMON_USER_POWER7 (COMMON_USER_PPC64 | PPC_FEATURE_ARCH_2_06 |\ PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | \ -PPC_FEATURE_TRUE_LE) +PPC_FEATURE_TRUE_LE | \ +PPC_FEATURE_PSERIES_PERFMON_COMPAT) #define COMMON_USER_PA6T (COMMON_USER_PPC64 | PPC_FEATURE_PA6T |\ PPC_FEATURE_TRUE_LE | \ PPC_FEATURE_HAS_ALTIVEC_COMP) diff --git a/include/asm-powerpc/cputable.h b/include/asm-powerpc/cputable.h index 3171ac9..58d4281 100644 --- a/include/asm-powerpc/cputable.h +++ b/include/asm-powerpc/cputable.h @@ -27,6 +27,9 @@ #define PPC_FEATURE_ARCH_2_06 0x0100 #define PPC_FEATURE_HAS_VSX0x0080 +#define PPC_FEATURE_PSERIES_PERFMON_COMPAT \ + 0x0040 + #define PPC_FEATURE_TRUE_LE0x0002 #define PPC_FEATURE_PPC_LE 0x0001 -- 1.5.6.2 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] elf loader support for auxvec base platform string
Some IBM POWER-based platforms have the ability to run in a mode which mostly appears to the OS as a different processor from the actual hardware. For example, a Power6 system may appear to be a Power5+, which makes the AT_PLATFORM value power5+. This means that programs are restricted to the ISA supported by Power5+; Power6-specific instructions are treated as illegal. However, some applications (virtual machines, optimized libraries) can benefit from knowledge of the underlying CPU model. A new aux vector entry, AT_BASE_PLATFORM, will denote the actual hardware. For example, on a Power6 system in Power5+ compatibility mode, AT_PLATFORM will be power5+ and AT_BASE_PLATFORM will be power6. The idea is that AT_PLATFORM indicates the instruction set supported, while AT_BASE_PLATFORM indicates the underlying microarchitecture. If the architecture has defined ELF_BASE_PLATFORM, copy that value to the user stack in the same manner as ELF_PLATFORM. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- fs/binfmt_elf.c| 23 +++ include/linux/auxvec.h |5 - 2 files changed, 27 insertions(+), 1 deletions(-) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index d48ff5f..834c2c4 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -131,6 +131,10 @@ static int padzero(unsigned long elf_bss) #define STACK_ALLOC(sp, len) ({ sp -= len ; sp; }) #endif +#ifndef ELF_BASE_PLATFORM +#define ELF_BASE_PLATFORM NULL +#endif + static int create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, unsigned long load_addr, unsigned long interp_load_addr) @@ -142,7 +146,9 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, elf_addr_t __user *envp; elf_addr_t __user *sp; elf_addr_t __user *u_platform; + elf_addr_t __user *u_base_platform; const char *k_platform = ELF_PLATFORM; + const char *k_base_platform = ELF_BASE_PLATFORM; int items; elf_addr_t *elf_info; int ei_index = 0; @@ -172,6 +178,19 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, return -EFAULT; } + /* +* If this architecture has a base platform capability +* string, copy it to userspace. +*/ + u_base_platform = NULL; + if (k_base_platform) { + size_t len = strlen(k_base_platform) + 1; + + u_base_platform = (elf_addr_t __user *)STACK_ALLOC(p, len); + if (__copy_to_user(u_base_platform, k_base_platform, len)) + return -EFAULT; + } + /* Create the ELF interpreter info */ elf_info = (elf_addr_t *)current-mm-saved_auxv; /* update AT_VECTOR_SIZE_BASE if the number of NEW_AUX_ENT() changes */ @@ -208,6 +227,10 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, NEW_AUX_ENT(AT_PLATFORM, (elf_addr_t)(unsigned long)u_platform); } + if (k_base_platform) { + NEW_AUX_ENT(AT_BASE_PLATFORM, + (elf_addr_t)(unsigned long)u_base_platform); + } if (bprm-interp_flags BINPRM_FLAGS_EXECFD) { NEW_AUX_ENT(AT_EXECFD, bprm-interp_data); } diff --git a/include/linux/auxvec.h b/include/linux/auxvec.h index ad89545..1adc61d 100644 --- a/include/linux/auxvec.h +++ b/include/linux/auxvec.h @@ -26,8 +26,11 @@ #define AT_SECURE 23 /* secure mode boolean */ +#define AT_BASE_PLATFORM 38/* string identifying real platform, may +* differ from AT_PLATFORM. */ + #ifdef __KERNEL__ -#define AT_VECTOR_SIZE_BASE (14 + 2) /* NEW_AUX_ENT entries in auxiliary table */ +#define AT_VECTOR_SIZE_BASE (14 + 3) /* NEW_AUX_ENT entries in auxiliary table */ #endif #endif /* _LINUX_AUXVEC_H */ -- 1.5.6.2 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] enable AT_BASE_PLATFORM aux vector for powerpc
Stash the first platform string matched by identify_cpu() in powerpc_base_platform, and supply that to the ELF loader for the value of AT_BASE_PLATFORM. Signed-off-by: Nathan Lynch [EMAIL PROTECTED] --- arch/powerpc/kernel/cputable.c | 11 +++ include/asm-powerpc/cputable.h |2 ++ include/asm-powerpc/elf.h |8 3 files changed, 21 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c index f7f3c21..89d8731 100644 --- a/arch/powerpc/kernel/cputable.c +++ b/arch/powerpc/kernel/cputable.c @@ -23,6 +23,9 @@ struct cpu_spec* cur_cpu_spec = NULL; EXPORT_SYMBOL(cur_cpu_spec); +/* The platform string corresponding to the real PVR */ +const char *powerpc_base_platform; + /* NOTE: * Unlike ppc32, ppc64 will only call this once for the boot CPU, it's * the responsibility of the appropriate CPU save/restore functions to @@ -1632,6 +1635,14 @@ struct cpu_spec * __init identify_cpu(unsigned long offset, unsigned int pvr) } else *t = *s; *PTRRELOC(cur_cpu_spec) = the_cpu_spec; + + /* +* Set the base platform string once; assumes +* we're called with real pvr first. +*/ + if (powerpc_base_platform == NULL) + powerpc_base_platform = t-platform; + #if defined(CONFIG_PPC64) || defined(CONFIG_BOOKE) /* ppc64 and booke expect identify_cpu to also call * setup_cpu for that processor. I will consolidate diff --git a/include/asm-powerpc/cputable.h b/include/asm-powerpc/cputable.h index 2a3e907..ef8a248 100644 --- a/include/asm-powerpc/cputable.h +++ b/include/asm-powerpc/cputable.h @@ -127,6 +127,8 @@ extern struct cpu_spec *identify_cpu(unsigned long offset, unsigned int pvr); extern void do_feature_fixups(unsigned long value, void *fixup_start, void *fixup_end); +extern const char *powerpc_base_platform; + #endif /* __ASSEMBLY__ */ /* CPU kernel features */ diff --git a/include/asm-powerpc/elf.h b/include/asm-powerpc/elf.h index 8966467..80d1f39 100644 --- a/include/asm-powerpc/elf.h +++ b/include/asm-powerpc/elf.h @@ -217,6 +217,14 @@ typedef elf_vrregset_t elf_fpxregset_t; #define ELF_PLATFORM (cur_cpu_spec-platform) +/* While ELF_PLATFORM indicates the ISA supported by the platform, it + * may not accurately reflect the underlying behavior of the hardware + * (as in the case of running in Power5+ compatibility mode on a + * Power6 machine). ELF_BASE_PLATFORM allows ld.so to load libraries + * that are tuned for the real hardware. + */ +#define ELF_BASE_PLATFORM (powerpc_base_platform) + #ifdef __powerpc64__ # define ELF_PLAT_INIT(_r, load_addr) do {\ _r-gpr[2] = load_addr; \ -- 1.5.6.2 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
AT_BASE_PLATFORM (v2)
Background: Some IBM POWER-based systems have the ability to run in a compatibility mode which mostly appears to the OS as a different processor from the actual hardware. This feature of the platform is useful for live partition migration and for backwards compatibility with old kernels on new hardware. For example, a Power6 system may appear to be a Power5+, which makes the AT_PLATFORM value power5+. Problem: Booting a system in a compatibility mode means that ld.so may load libraries that are inappropriately tuned for the real microarchitecture, and apps that use JIT techniques do not have the right information for generating tuned code. While the AT_PLATFORM auxiliary vector entry correctly indicates the ISA supported, it does not accurately reflect the underlying microarchitecture in this case, and there is no good way for userspace to get this information. Proposed solution: Add an AT_BASE_PLATFORM auxiliary vector entry which indicates the microarchitecture. This entry uses the same string format as AT_PLATFORM, and is readily usable by ld.so and other applications. Other solutions that have been suggested but found wanting: - Use a bit in AT_HWCAP to indicate compat mode -- this is not expressive enough. It's not possible to derive the microarchitecture from the combination of AT_PLATFORM's value and a single bit. - Use dsocaps -- this seems to be a ld.so-specific interface and not easily usable by other programs. ld.so/glibc is not the only program that can use knowledge of the microarchitecture. The following two patches: - add the base support to binfmt_elf.c for AT_BASE_PLATFORM - implement AT_BASE_PLATFORM for powerpc Changes since v1: - increment AT_VECTOR_SIZE_BASE - define AT_BASE_PLATFORM in generic code instead of powerpc ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev