Re: [PATCH] scripts/faddr2line: Fix regression in name resolution on ppc64le
On Tue, Sep 27, 2022 at 01:22:11PM +0530, Srikar Dronamraju wrote: > Commit 1d1a0e7c5100 ("scripts/faddr2line: Fix overlapping text section > failures") > can cause scripts/faddr2line to fail on ppc64le machines on few > distributions, while working on other distributions. The failure can be > attributed to difference in readelf output on various distributions. > > $ ./scripts/faddr2line vmlinux find_busiest_group+0x00 > no match for find_busiest_group+0x00 > > Expected output was: > $ ./scripts/faddr2line vmlinux find_busiest_group+0x00 > find_busiest_group+0x00/0x3d0: > find_busiest_group at kernel/sched/fair.c:9595 > > On ppc64le, readelf adds localentry tag before the symbol name on few > distributions and adds the localentry tag after the symbol name on few > other distributions. This problem has been discussed in the reference > URL given below. This problem can be overcome by filtering out > localentry tags in readelf output. Similar fixes are already present in > kernel by way of commits: > > 1fd6cee127e2 ("libbpf: Fix VERSIONED_SYM_COUNT number parsing") > aa915931ac3e ("libbpf: Fix readelf output parsing for Fedora") > > Fixes: 1d1a0e7c5100 ("scripts/faddr2line: Fix overlapping text section > failures") > Reference: https://lore.kernel.org/bpf/20191211160133.GB4580@calabresa/ > Cc: "Naveen N. Rao" > Cc: Jiri Olsa > Cc: Thadeu Lima de Souza Cascardo Reviewed-by: Thadeu Lima de Souza Cascardo The other instances of readelf --wide on faddr2line use --section-headers and should not required the same mangling. Cascardo. > Cc: Josh Poimboeuf > Cc: Peter Zijlstra (Intel) > Cc: linuxppc-dev > Cc: LKML > Signed-off-by: Srikar Dronamraju > --- > scripts/faddr2line | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/scripts/faddr2line b/scripts/faddr2line > index 5514c23f45c2..0e73aca4f908 100755 > --- a/scripts/faddr2line > +++ b/scripts/faddr2line > @@ -74,7 +74,8 @@ command -v ${ADDR2LINE} >/dev/null 2>&1 || die > "${ADDR2LINE} isn't installed" > find_dir_prefix() { > local objfile=$1 > > - local start_kernel_addr=$(${READELF} --symbols --wide $objfile | ${AWK} > '$8 == "start_kernel" {printf "0x%s", $2}') > + local start_kernel_addr=$(${READELF} --symbols --wide $objfile | sed > 's/\[.*\]//' | > + ${AWK} '$8 == "start_kernel" {printf "0x%s", $2}') > [[ -z $start_kernel_addr ]] && return > > local file_line=$(${ADDR2LINE} -e $objfile $start_kernel_addr) > @@ -178,7 +179,7 @@ __faddr2line() { > found=2 > break > fi > - done < <(${READELF} --symbols --wide $objfile | ${AWK} -v > sec=$sym_sec '$7 == sec' | sort --key=2) > + done < <(${READELF} --symbols --wide $objfile | sed > 's/\[.*\]//' | ${AWK} -v sec=$sym_sec '$7 == sec' | sort --key=2) > > if [[ $found = 0 ]]; then > warn "can't find symbol: sym_name: $sym_name sym_sec: > $sym_sec sym_addr: $sym_addr sym_elf_size: $sym_elf_size" > @@ -259,7 +260,7 @@ __faddr2line() { > > DONE=1 > > - done < <(${READELF} --symbols --wide $objfile | ${AWK} -v fn=$sym_name > '$4 == "FUNC" && $8 == fn') > + done < <(${READELF} --symbols --wide $objfile | sed 's/\[.*\]//' | > ${AWK} -v fn=$sym_name '$4 == "FUNC" && $8 == fn') > } > > [[ $# -lt 2 ]] && usage > > base-commit: bf682942cd26ce9cd5e87f73ae099b383041e782 > -- > 2.31.1 >
[PATCH] selftests/powerpc/spectre_v2: Return skip code when miss_percent is high
A mis-match between reported and actual mitigation is not restricted to the Vulnerable case. The guest might also report the mitigation as "Software count cache flush" and the host will still mitigate with branch cache disabled. So, instead of skipping depending on the detected mitigation, simply skip whenever the detected miss_percent is the expected one for a fully mitigated system, that is, above 95%. Signed-off-by: Thadeu Lima de Souza Cascardo --- tools/testing/selftests/powerpc/security/spectre_v2.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/powerpc/security/spectre_v2.c b/tools/testing/selftests/powerpc/security/spectre_v2.c index adc2b7294e5f..83647b8277e7 100644 --- a/tools/testing/selftests/powerpc/security/spectre_v2.c +++ b/tools/testing/selftests/powerpc/security/spectre_v2.c @@ -193,7 +193,7 @@ int spectre_v2_test(void) * We are not vulnerable and reporting otherwise, so * missing such a mismatch is safe. */ - if (state == VULNERABLE) + if (miss_percent > 95) return 4; return 1; -- 2.32.0
coherency issue observed after hotplug on POWER8
Hi, there. We have been investigating an issue we have observed on POWER8 POWERNV systems. When running the kernel selftests reuseport_bpf_cpu after a CPU hotplug, we see crashes, in different forms. [1] I managed to get xmon on that trap, and did some debugging. [2] I tried to dump the BPF JIT code, and it looks different when dumped from CPU#0 and CPU#0x9f (the one that was hotplugged, offlined, then onlined). Here is my partial analysis [3]. Basically, the BPF JIT fills a page with invalid instructions (traps, in ppc64 case), and puts the BPF program in a random offset of the page. In the case of the hotplugged CPU, which was the one that compiled the program, the page had the expected contents (BPF program started at the offset used to run the program). On the other CPU (in many cases, CPU #0), the same memory address/page had different contents, with the program starting at a different offset. Is this a case of a bug in the micro-architecture or the firmware when doing the hotplug? Can someone chime in? Notice that we can't reproduce the same issue on a POWER9 system. Thanks. Cascardo. [1] https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076 [2] https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/comments/29 [3] https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/comments/30
[PATCH] powerpc/perf: prevent mixed EBB and non-EBB events
EBB events must be under exclusive groups, so there is no mix of EBB and non-EBB events on the same PMU. This requirement worked fine as perf core would not allow other pinned events to be scheduled together with exclusive events. This assumption was broken by commit 1908dc911792 ("perf: Tweak perf_event_attr::exclusive semantics"). After that, the test cpu_event_pinned_vs_ebb_test started succeeding after read_events, but worse, the task would not have given access to PMC1, so when it tried to write to it, it was killed with "illegal instruction". Preventing mixed EBB and non-EBB events from being add to the same PMU will just revert to the previous behavior and the test will succeed. Fixes: 1908dc911792 (perf: Tweak perf_event_attr::exclusive semantics) Signed-off-by: Thadeu Lima de Souza Cascardo --- arch/powerpc/perf/core-book3s.c | 20 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 43599e671d38..d767f7944f85 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -1010,9 +1010,25 @@ static int check_excludes(struct perf_event **ctrs, unsigned int cflags[], int n_prev, int n_new) { int eu = 0, ek = 0, eh = 0; + bool ebb = false; int i, n, first; struct perf_event *event; + n = n_prev + n_new; + if (n <= 1) + return 0; + + first = 1; + for (i = 0; i < n; ++i) { + event = ctrs[i]; + if (first) { + ebb = is_ebb_event(event); + first = 0; + } else if (is_ebb_event(event) != ebb) { + return -EAGAIN; + } + } + /* * If the PMU we're on supports per event exclude settings then we * don't need to do any of this logic. NB. This assumes no PMU has both @@ -1021,10 +1037,6 @@ static int check_excludes(struct perf_event **ctrs, unsigned int cflags[], if (ppmu->flags & PPMU_ARCH_207S) return 0; - n = n_prev + n_new; - if (n <= 1) - return 0; - first = 1; for (i = 0; i < n; ++i) { if (cflags[i] & PPMU_LIMITED_PMC_OK) { -- 2.27.0
Re: [PATCH] selftests/seccomp: fix ptrace tests on powerpc
On Thu, Sep 17, 2020 at 03:37:16PM -0700, Kees Cook wrote: > On Sun, Sep 13, 2020 at 10:34:23PM +1000, Michael Ellerman wrote: > > Thadeu Lima de Souza Cascardo writes: > > > On Tue, Sep 08, 2020 at 04:18:17PM -0700, Kees Cook wrote: > > >> On Tue, Jun 30, 2020 at 01:47:39PM -0300, Thadeu Lima de Souza Cascardo > > >> wrote: > > ... > > >> > @@ -1809,10 +1818,15 @@ void tracer_ptrace(struct __test_metadata > > >> > *_metadata, pid_t tracee, > > >> >EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY > > >> >: PTRACE_EVENTMSG_SYSCALL_EXIT, msg); > > >> > > > >> > - if (!entry) > > >> > + if (!entry && !syscall_nr) > > >> >return; > > >> > > > >> > - nr = get_syscall(_metadata, tracee); > > >> > + if (entry) > > >> > + nr = get_syscall(_metadata, tracee); > > >> > + else > > >> > + nr = *syscall_nr; > > >> > > >> This is weird? Shouldn't get_syscall() be modified to do the right thing > > >> here instead of depending on the extra arg? > > >> > > > > > > R0 might be clobered. Same documentation mentions it as volatile. So, > > > during > > > syscall exit, we can't tell for sure that R0 will have the original > > > syscall > > > number. So, we need to grab it during syscall enter, save it somewhere and > > > reuse it. I used the test context/args for that. > > > > The user r0 (in regs->gpr[0]) shouldn't be clobbered. > > > > But it is modified if the tracer skips the syscall, by setting the > > syscall number to -1. Or if the tracer changes the syscall number. > > > > So if you need the original syscall number in the exit path then I think > > you do need to save it at entry. > > ... the selftest code wants to test the updated syscall (-1 or > whatever), so this sounds correct. Was this test actually failing on > powerpc? (I'd really rather not split entry/exit if I don't have to.) > > -- > Kees Cook Yes, it started failing when the return code started being changed as well. Though ptrace can change the syscall at entry (IIRC), it can't change the return code. And that is part of the ABI. If ppc is changed so it allows changing the return code during ptrace entry, some strace uses will break. So that is not an option. Cascardo.
[PATCH v2] selftests/seccomp: fix ptrace tests on powerpc
As pointed out by Michael Ellerman, the ptrace ABI on powerpc does not allow or require the return code to be set on syscall entry when skipping the syscall. It will always return ENOSYS and the return code must be set on syscall exit. This code does that, behaving more similarly to strace. It still sets the return code on entry, which is overridden on powerpc, and it will always repeat the same on exit. Also, on powerpc, the errno is not inverted, and depends on ccr.so being set. This has been tested on powerpc and amd64. Cc: Michael Ellerman Cc: Kees Cook Signed-off-by: Thadeu Lima de Souza Cascardo --- tools/testing/selftests/seccomp/seccomp_bpf.c | 81 --- 1 file changed, 53 insertions(+), 28 deletions(-) diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index 7a6d40286a42..0ddc0846e9c0 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -1837,15 +1837,24 @@ void change_syscall(struct __test_metadata *_metadata, #endif /* If syscall is skipped, change return value. */ - if (syscall == -1) + if (syscall == -1) { #ifdef SYSCALL_NUM_RET_SHARE_REG TH_LOG("Can't modify syscall return on this architecture"); - #elif defined(__xtensa__) regs.SYSCALL_RET(regs) = result; +#elif defined(__powerpc__) + /* Error is signaled by CR0 SO bit and error code is positive. */ + if (result < 0) { + regs.SYSCALL_RET = -result; + regs.ccr |= 0x1000; + } else { + regs.SYSCALL_RET = result; + regs.ccr &= ~0x1000; + } #else regs.SYSCALL_RET = result; #endif + } #ifdef HAVE_GETREGS ret = ptrace(PTRACE_SETREGS, tracee, 0, ); @@ -1897,12 +1906,44 @@ void tracer_seccomp(struct __test_metadata *_metadata, pid_t tracee, } +FIXTURE(TRACE_syscall) { + struct sock_fprog prog; + pid_t tracer, mytid, mypid, parent; +}; + +FIXTURE_VARIANT(TRACE_syscall) { + /* +* All of the SECCOMP_RET_TRACE behaviors can be tested with either +* SECCOMP_RET_TRACE+PTRACE_CONT or plain ptrace()+PTRACE_SYSCALL. +* This indicates if we should use SECCOMP_RET_TRACE (false), or +* ptrace (true). +*/ + bool use_ptrace; + + /* +* Some archs (like ppc) only support changing the return code during +* syscall exit when ptrace is used. As the syscall number might not +* be available anymore during syscall exit, it needs to be saved +* during syscall enter. +*/ + int syscall_nr; +}; + +FIXTURE_VARIANT_ADD(TRACE_syscall, ptrace) { + .use_ptrace = true, +}; + +FIXTURE_VARIANT_ADD(TRACE_syscall, seccomp) { + .use_ptrace = false, +}; + void tracer_ptrace(struct __test_metadata *_metadata, pid_t tracee, int status, void *args) { int ret, nr; unsigned long msg; static bool entry; + FIXTURE_VARIANT(TRACE_syscall) * variant = args; /* * The traditional way to tell PTRACE_SYSCALL entry/exit @@ -1916,10 +1957,15 @@ void tracer_ptrace(struct __test_metadata *_metadata, pid_t tracee, EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY : PTRACE_EVENTMSG_SYSCALL_EXIT, msg); - if (!entry) + if (!entry && !variant) return; - nr = get_syscall(_metadata, tracee); + if (entry) + nr = get_syscall(_metadata, tracee); + else if (variant) + nr = variant->syscall_nr; + if (variant) + variant->syscall_nr = nr; if (nr == __NR_getpid) change_syscall(_metadata, tracee, __NR_getppid, 0); @@ -1929,29 +1975,6 @@ void tracer_ptrace(struct __test_metadata *_metadata, pid_t tracee, change_syscall(_metadata, tracee, -1, -ESRCH); } -FIXTURE(TRACE_syscall) { - struct sock_fprog prog; - pid_t tracer, mytid, mypid, parent; -}; - -FIXTURE_VARIANT(TRACE_syscall) { - /* -* All of the SECCOMP_RET_TRACE behaviors can be tested with either -* SECCOMP_RET_TRACE+PTRACE_CONT or plain ptrace()+PTRACE_SYSCALL. -* This indicates if we should use SECCOMP_RET_TRACE (false), or -* ptrace (true). -*/ - bool use_ptrace; -}; - -FIXTURE_VARIANT_ADD(TRACE_syscall, ptrace) { - .use_ptrace = true, -}; - -FIXTURE_VARIANT_ADD(TRACE_syscall, seccomp) { - .use_ptrace = false, -}; - FIXTURE_SETUP(TRACE_syscall) { struct sock_filter filter[] = { @@ -1992,7 +2015,9 @@ FIXTURE_SETUP(TRACE_syscall) self->tracer = setup_trace_fixture(_metadata, v
Re: [PATCH] selftests/seccomp: fix ptrace tests on powerpc
On Tue, Sep 08, 2020 at 04:18:17PM -0700, Kees Cook wrote: > On Tue, Jun 30, 2020 at 01:47:39PM -0300, Thadeu Lima de Souza Cascardo wrote: > > As pointed out by Michael Ellerman, the ptrace ABI on powerpc does not > > allow or require the return code to be set on syscall entry when > > skipping the syscall. It will always return ENOSYS and the return code > > must be set on syscall exit. > > > > This code does that, behaving more similarly to strace. It still sets > > the return code on entry, which is overridden on powerpc, and it will > > always repeat the same on exit. Also, on powerpc, the errno is not > > inverted, and depends on ccr.so being set. > > > > This has been tested on powerpc and amd64. > > > > Cc: Michael Ellerman > > Cc: Kees Cook > > Signed-off-by: Thadeu Lima de Souza Cascardo > > Yikes, I missed this from a while ago. I apologize for responding so > late! > > This appears still unfixed; is that correct? > Yes. I will send a v2 on top of recent changes to the test. > > --- > > tools/testing/selftests/seccomp/seccomp_bpf.c | 24 +++ > > 1 file changed, 20 insertions(+), 4 deletions(-) > > > > diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c > > b/tools/testing/selftests/seccomp/seccomp_bpf.c > > index 252140a52553..b90a9190ba88 100644 > > --- a/tools/testing/selftests/seccomp/seccomp_bpf.c > > +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c > > @@ -1738,6 +1738,14 @@ void change_syscall(struct __test_metadata > > *_metadata, > > TH_LOG("Can't modify syscall return on this architecture"); > > #else > > regs.SYSCALL_RET = result; > > +# if defined(__powerpc__) > > + if (result < 0) { > > + regs.SYSCALL_RET = -result; > > + regs.ccr |= 0x1000; > > + } else { > > + regs.ccr &= ~0x1000; > > + } > > +# endif > > #endif > > Just so I understand correctly: for ppc to "see" this result, it needs > to be both negative AND have this specific register set? > Yes. According to Documentation/powerpc/syscall64-abi.rst: " - For the sc instruction, both a value and an error condition are returned. cr0.SO is the error condition, and r3 is the return value. When cr0.SO is clear, the syscall succeeded and r3 is the return value. When cr0.SO is set, the syscall failed and r3 is the error value (that normally corresponds to errno). " So, while some other arches will return -EINVAL, ppc returns EINVAL. And that is distinguished from, say, read(2) returning 22 bytes read, by using CR.SO. > > > > #ifdef HAVE_GETREGS > > @@ -1796,6 +1804,7 @@ void tracer_ptrace(struct __test_metadata *_metadata, > > pid_t tracee, > > int ret, nr; > > unsigned long msg; > > static bool entry; > > + int *syscall_nr = args; > > > > /* > > * The traditional way to tell PTRACE_SYSCALL entry/exit > > @@ -1809,10 +1818,15 @@ void tracer_ptrace(struct __test_metadata > > *_metadata, pid_t tracee, > > EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY > > : PTRACE_EVENTMSG_SYSCALL_EXIT, msg); > > > > - if (!entry) > > + if (!entry && !syscall_nr) > > return; > > > > - nr = get_syscall(_metadata, tracee); > > + if (entry) > > + nr = get_syscall(_metadata, tracee); > > + else > > + nr = *syscall_nr; > > This is weird? Shouldn't get_syscall() be modified to do the right thing > here instead of depending on the extra arg? > R0 might be clobered. Same documentation mentions it as volatile. So, during syscall exit, we can't tell for sure that R0 will have the original syscall number. So, we need to grab it during syscall enter, save it somewhere and reuse it. I used the test context/args for that. That's the main change I had to deal with after recent changes to the test. I used the variant struct now. I only saw the need to do this under tracer_ptrace, as that was the only one changing syscall return values using ptrace. And that can only be done during syscall exit on ppc (ptrace ABI we can't break). So, changing get_syscall did not seem necessary. Thanks. Cascardo. > > + if (syscall_nr) > > + *syscall_nr = nr; > > > > if (nr == __NR_getpid) > > change_syscall(_metadata, tracee, __NR_getppid, 0); > > @@ -1889,9 +1903,10 @@ TEST_F(TRACE_syscall, ptrace_syscall_redirected) > > > > TEST_F(TRACE_syscall, ptrace_syscall_errno
Re: [PATCH] KVM: PPC: Don't return -ENOTSUPP to userspace in ioctls
On Fri, Sep 11, 2020 at 12:53:45PM +0200, Greg Kurz wrote: > ENOTSUPP is a linux only thingy, the value of which is unknown to > userspace, not to be confused with ENOTSUP which linux maps to > EOPNOTSUPP, as permitted by POSIX [1]: > > [EOPNOTSUPP] > Operation not supported on socket. The type of socket (address family > or protocol) does not support the requested operation. A conforming > implementation may assign the same values for [EOPNOTSUPP] and [ENOTSUP]. > > Return -EOPNOTSUPP instead of -ENOTSUPP for the following ioctls: > - KVM_GET_FPU for Book3s and BookE > - KVM_SET_FPU for Book3s and BookE > - KVM_GET_DIRTY_LOG for BookE > > This doesn't affect QEMU which doesn't call the KVM_GET_FPU and > KVM_SET_FPU ioctls on POWER anyway since they are not supported, > and _buggily_ ignores anything but -EPERM for KVM_GET_DIRTY_LOG. > > [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html > > Signed-off-by: Greg Kurz Agreed. ENOTSUPP should never be returned to userspace. Acked-by: Thadeu Lima de Souza Cascardo > --- > arch/powerpc/kvm/book3s.c |4 ++-- > arch/powerpc/kvm/booke.c |6 +++--- > 2 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c > index 1fce9777af1c..44bf567b6589 100644 > --- a/arch/powerpc/kvm/book3s.c > +++ b/arch/powerpc/kvm/book3s.c > @@ -558,12 +558,12 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, > struct kvm_regs *regs) > > int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) > { > - return -ENOTSUPP; > + return -EOPNOTSUPP; > } > > int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) > { > - return -ENOTSUPP; > + return -EOPNOTSUPP; > } > > int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id, > diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c > index 3e1c9f08e302..b1abcb816439 100644 > --- a/arch/powerpc/kvm/booke.c > +++ b/arch/powerpc/kvm/booke.c > @@ -1747,12 +1747,12 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, > > int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) > { > - return -ENOTSUPP; > + return -EOPNOTSUPP; > } > > int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) > { > - return -ENOTSUPP; > + return -EOPNOTSUPP; > } > > int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu, > @@ -1773,7 +1773,7 @@ void kvm_arch_sync_dirty_log(struct kvm *kvm, struct > kvm_memory_slot *memslot) > > int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) > { > - return -ENOTSUPP; > + return -EOPNOTSUPP; > } > > void kvmppc_core_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) > >
Please apply commit 0828137e8f16 ("powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()") to v4.14.y, v4.19.y, v5.4.y, v5.7.y
After commit 912c0a7f2b5daa3cbb2bc10f303981e493de73bd ("powerpc/64s: Save FSCR to init_task.thread.fscr after feature init"), which has been applied to the referred branches, when userspace sets the user DSCR MSR, it won't be inherited or restored during context switch, because the facility unavailable interrupt won't trigger. Applying 0828137e8f16721842468e33df0460044a0c588b ("powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()") will fix it. Cascardo.
Re: [PATCH v2] powerpc/vio: drop bus_type from parent device
On Thu, Jul 30, 2020 at 07:37:16AM +0200, Greg KH wrote: > On Thu, Jul 30, 2020 at 11:28:38AM +1000, Michael Ellerman wrote: > > [ Added Peter & Greg to Cc ] > > > > Thadeu Lima de Souza Cascardo writes: > > > Commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ("kobject: return error > > > code if writing /sys/.../uevent fails") started returning failure when > > > writing to /sys/devices/vio/uevent. > > > > > > This causes an early udevadm trigger to fail. On some installer versions > > > of > > > Ubuntu, this will cause init to exit, thus panicing the system very early > > > during boot. > > > > > > Removing the bus_type from the parent device will remove some of the extra > > > empty files from /sys/devices/vio/, but will keep the rest of the layout > > > for vio devices, keeping them under /sys/devices/vio/. > > > > What exactly does it change? > > > > I'm finding it hard to evaluate if this change is going to cause a > > regression somehow. > > > > I'm also not clear on why removing the bus type is correct, apart from > > whether it fixes the bug you're seeing. > > > > > It has been tested that uevents for vio devices don't change after this > > > fix, they still contain MODALIAS. > > > > > > Signed-off-by: Thadeu Lima de Souza Cascardo > > > Fixes: df44b479654f ("kobject: return error code if writing > > > /sys/.../uevent fails") > > > > AFAICS there haven't been any other fixes for that commit. Do we know > > why it is only vio that was affected? (possibly because it's a fake bus > > to begin with?) > > So there was an error previously, the core was ignoring it, and now it > isn't and to fix that you want to remove describing what bus a device is > on? > > Huh??? > > > > > cheers > > > > > diff --git a/arch/powerpc/platforms/pseries/vio.c > > > b/arch/powerpc/platforms/pseries/vio.c > > > index 37f1f25ba804..a94dab3972a0 100644 > > > --- a/arch/powerpc/platforms/pseries/vio.c > > > +++ b/arch/powerpc/platforms/pseries/vio.c > > > @@ -36,7 +36,6 @@ static struct vio_dev vio_bus_device = { /* fake > > > "parent" device */ > > > .name = "vio", > > > .type = "", > > > .dev.init_name = "vio", > > > - .dev.bus = _bus_type, > > > }; > > Wait, a static 'struct device'? You all are playing with fire there. > That's a reference counted object, and should never be declared like > that at all. > > I see you register it, but never unregister it, why? Why is it even > needed? > > And if you remove the bus type of it, it will show up in a different > part of sysfs, so I think this patch will show a user-visable change, > right? > > thanks, > > greg k-h As the comment says, it's a "fake parent device". There is a user-visible change, which is removing some attributes from the object, but it's still showing up on the same path. Returning an error code like df44b479654f does is also a user visible change and it breaks installer images that panic early on boot. I could investigate an alternative here, which would be not fail when writing to uevent for this specific fake device. Cascardo.
[PATCH] selftests/powerpc: return skip code for spectre_v2
When running under older versions of qemu of under newer versions with old machine types, some security features will not be reported to the guest. This will lead the guest OS to consider itself Vulnerable to spectre_v2. So, spectre_v2 test fails in such cases when the host is mitigated and miss predictions cannot be detected as expected by the test. Make it return the skip code instead, for this particular case. We don't want to miss the case when the test fails and the system reports as mitigated or not affected. But it is not a problem to miss failures when the system reports as Vulnerable. Signed-off-by: Thadeu Lima de Souza Cascardo --- tools/testing/selftests/powerpc/security/spectre_v2.c | 8 1 file changed, 8 insertions(+) diff --git a/tools/testing/selftests/powerpc/security/spectre_v2.c b/tools/testing/selftests/powerpc/security/spectre_v2.c index 8c6b982af2a8..d5445bfd63ed 100644 --- a/tools/testing/selftests/powerpc/security/spectre_v2.c +++ b/tools/testing/selftests/powerpc/security/spectre_v2.c @@ -183,6 +183,14 @@ int spectre_v2_test(void) if (miss_percent > 15) { printf("Branch misses > 15%% unexpected in this configuration!\n"); printf("Possible mis-match between reported & actual mitigation\n"); + /* Such a mismatch may be caused by a guest system +* reporting as vulnerable when the host is mitigated. +* Return skip code to avoid detecting this as an +* error. We are not vulnerable and reporting otherwise, +* so missing such a mismatch is safe. +*/ + if (state == VULNERABLE) + return 4; return 1; } break; -- 2.25.1
[PATCH] selftests/seccomp: fix ptrace tests on powerpc
As pointed out by Michael Ellerman, the ptrace ABI on powerpc does not allow or require the return code to be set on syscall entry when skipping the syscall. It will always return ENOSYS and the return code must be set on syscall exit. This code does that, behaving more similarly to strace. It still sets the return code on entry, which is overridden on powerpc, and it will always repeat the same on exit. Also, on powerpc, the errno is not inverted, and depends on ccr.so being set. This has been tested on powerpc and amd64. Cc: Michael Ellerman Cc: Kees Cook Signed-off-by: Thadeu Lima de Souza Cascardo --- tools/testing/selftests/seccomp/seccomp_bpf.c | 24 +++ 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index 252140a52553..b90a9190ba88 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -1738,6 +1738,14 @@ void change_syscall(struct __test_metadata *_metadata, TH_LOG("Can't modify syscall return on this architecture"); #else regs.SYSCALL_RET = result; +# if defined(__powerpc__) + if (result < 0) { + regs.SYSCALL_RET = -result; + regs.ccr |= 0x1000; + } else { + regs.ccr &= ~0x1000; + } +# endif #endif #ifdef HAVE_GETREGS @@ -1796,6 +1804,7 @@ void tracer_ptrace(struct __test_metadata *_metadata, pid_t tracee, int ret, nr; unsigned long msg; static bool entry; + int *syscall_nr = args; /* * The traditional way to tell PTRACE_SYSCALL entry/exit @@ -1809,10 +1818,15 @@ void tracer_ptrace(struct __test_metadata *_metadata, pid_t tracee, EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY : PTRACE_EVENTMSG_SYSCALL_EXIT, msg); - if (!entry) + if (!entry && !syscall_nr) return; - nr = get_syscall(_metadata, tracee); + if (entry) + nr = get_syscall(_metadata, tracee); + else + nr = *syscall_nr; + if (syscall_nr) + *syscall_nr = nr; if (nr == __NR_getpid) change_syscall(_metadata, tracee, __NR_getppid, 0); @@ -1889,9 +1903,10 @@ TEST_F(TRACE_syscall, ptrace_syscall_redirected) TEST_F(TRACE_syscall, ptrace_syscall_errno) { + int syscall_nr = -1; /* Swap SECCOMP_RET_TRACE tracer for PTRACE_SYSCALL tracer. */ teardown_trace_fixture(_metadata, self->tracer); - self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, NULL, + self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, _nr, true); /* Tracer should skip the open syscall, resulting in ESRCH. */ @@ -1900,9 +1915,10 @@ TEST_F(TRACE_syscall, ptrace_syscall_errno) TEST_F(TRACE_syscall, ptrace_syscall_faked) { + int syscall_nr = -1; /* Swap SECCOMP_RET_TRACE tracer for PTRACE_SYSCALL tracer. */ teardown_trace_fixture(_metadata, self->tracer); - self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, NULL, + self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, _nr, true); /* Tracer should skip the gettid syscall, resulting fake pid. */ -- 2.25.1
[PATCH v2] powerpc/vio: drop bus_type from parent device
Commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ("kobject: return error code if writing /sys/.../uevent fails") started returning failure when writing to /sys/devices/vio/uevent. This causes an early udevadm trigger to fail. On some installer versions of Ubuntu, this will cause init to exit, thus panicing the system very early during boot. Removing the bus_type from the parent device will remove some of the extra empty files from /sys/devices/vio/, but will keep the rest of the layout for vio devices, keeping them under /sys/devices/vio/. It has been tested that uevents for vio devices don't change after this fix, they still contain MODALIAS. Signed-off-by: Thadeu Lima de Souza Cascardo Fixes: df44b479654f ("kobject: return error code if writing /sys/.../uevent fails") --- arch/powerpc/platforms/pseries/vio.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c index 37f1f25ba804..a94dab3972a0 100644 --- a/arch/powerpc/platforms/pseries/vio.c +++ b/arch/powerpc/platforms/pseries/vio.c @@ -36,7 +36,6 @@ static struct vio_dev vio_bus_device = { /* fake "parent" device */ .name = "vio", .type = "", .dev.init_name = "vio", - .dev.bus = _bus_type, }; #ifdef CONFIG_PPC_SMLPAR -- 2.20.1
[PATCH v2] powerpc/ptrace: Do not return ENOSYS if invalid syscall
If a tracer sets the syscall number to an invalid one, allow the return value set by the tracer to be returned the tracee. The test for NR_syscalls is already at entry_64.S, and it's at do_syscall_trace_enter only to skip audit and trace. After this, two failures from seccomp_bpf selftests complete just fine, as the failing test was using ptrace to change the syscall to return an error or a fake value, but were failing as it was always returning -ENOSYS. Signed-off-by: Thadeu Lima de Souza Cascardo --- arch/powerpc/kernel/ptrace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c index 25c0424e8868..557ae4bc2331 100644 --- a/arch/powerpc/kernel/ptrace.c +++ b/arch/powerpc/kernel/ptrace.c @@ -3314,7 +3314,7 @@ long do_syscall_trace_enter(struct pt_regs *regs) /* Avoid trace and audit when syscall is invalid. */ if (regs->gpr[0] >= NR_syscalls) - goto skip; + return regs->gpr[0]; if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT))) trace_sys_enter(regs, regs->gpr[0]); -- 2.17.1
[PATCH] libbpf: Fix readelf output parsing for Fedora
Fedora binutils has been patched to show "other info" for a symbol at the end of the line. This was done in order to support unmaintained scripts that would break with the extra info. [1] [1] https://src.fedoraproject.org/rpms/binutils/c/b8265c46f7ddae23a792ee8306fbaaeacba83bf8 This in turn has been done to fix the build of ruby, because of checksec. [2] Thanks Michael Ellerman for the pointer. [2] https://bugzilla.redhat.com/show_bug.cgi?id=1479302 As libbpf Makefile is not unmaintained, we can simply deal with either output format, by just removing the "other info" field, as it always comes inside brackets. Cc: Aurelien Jarno Fixes: 3464afdf11f9 (libbpf: Fix readelf output parsing on powerpc with recent binutils) Reported-by: Justin Forbes Signed-off-by: Thadeu Lima de Souza Cascardo --- tools/lib/bpf/Makefile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile index defae23a0169..23ae06c43d08 100644 --- a/tools/lib/bpf/Makefile +++ b/tools/lib/bpf/Makefile @@ -147,6 +147,7 @@ TAGS_PROG := $(if $(shell which etags 2>/dev/null),etags,ctags) GLOBAL_SYM_COUNT = $(shell readelf -s --wide $(BPF_IN_SHARED) | \ cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | \ + sed 's/\[.*\]//' | \ awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$NF}' | \ sort -u | wc -l) VERSIONED_SYM_COUNT = $(shell readelf -s --wide $(OUTPUT)libbpf.so | \ @@ -213,6 +214,7 @@ check_abi: $(OUTPUT)libbpf.so "versioned in $(VERSION_SCRIPT)." >&2; \ readelf -s --wide $(BPF_IN_SHARED) | \ cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | \ + sed 's/\[.*\]//' | \ awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$NF}'| \ sort -u > $(OUTPUT)libbpf_global_syms.tmp; \ readelf -s --wide $(OUTPUT)libbpf.so | \ -- 2.24.0
Re: [PATCH] libbpf: fix readelf output parsing on powerpc with recent binutils
On Wed, Dec 11, 2019 at 09:33:53AM -0600, Justin Forbes wrote: > On Tue, Dec 10, 2019 at 4:26 PM Thadeu Lima de Souza Cascardo > wrote: > > > > On Tue, Dec 10, 2019 at 12:58:33PM -0600, Justin Forbes wrote: > > > On Mon, Dec 2, 2019 at 3:37 AM Daniel Borkmann > > > wrote: > > > > > > > > On Mon, Dec 02, 2019 at 04:53:26PM +1100, Michael Ellerman wrote: > > > > > Aurelien Jarno writes: > > > > > > On powerpc with recent versions of binutils, readelf outputs an > > > > > > extra > > > > > > field when dumping the symbols of an object file. For example: > > > > > > > > > > > > 35: 083896 FUNCLOCAL DEFAULT > > > > > > [: 8] 1 btf_is_struct > > > > > > > > > > > > The extra "[: 8]" prevents the GLOBAL_SYM_COUNT > > > > > > variable to > > > > > > be computed correctly and causes the checkabi target to fail. > > > > > > > > > > > > Fix that by looking for the symbol name in the last field instead > > > > > > of the > > > > > > 8th one. This way it should also cope with future extra fields. > > > > > > > > > > > > Signed-off-by: Aurelien Jarno > > > > > > --- > > > > > > tools/lib/bpf/Makefile | 4 ++-- > > > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > > > Thanks for fixing that, it's been on my very long list of test > > > > > failures > > > > > for a while. > > > > > > > > > > Tested-by: Michael Ellerman > > > > > > > > Looks good & also continues to work on x86. Applied, thanks! > > > > > > This actually seems to break horribly on PPC64le with binutils 2.33.1 > > > resulting in: > > > Warning: Num of global symbols in sharedobjs/libbpf-in.o (32) does NOT > > > match with num of versioned symbols in libbpf.so (184). Please make > > > sure all LIBBPF_API symbols are versioned in libbpf.map. > > > > > > This is the only arch that fails, with x86/arm/aarch64/s390 all > > > building fine. Reverting this patch allows successful build across > > > all arches. > > > > > > Justin > > > > Well, I ended up debugging this same issue and had the same fix as Jarno's > > when > > I noticed his fix was already applied. > > > > I just installed a system with the latest binutils, 2.33.1, and it still > > breaks > > without such fix. Can you tell what is the output of the following command > > on > > your system? > > > > readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@" -f1 | > > sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ && !/UND/ > > {print $0}' > > > > readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@" > -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ && > !/UND/ {print $0}' >373: 000141bc 1376 FUNCGLOBAL DEFAULT1 > libbpf_num_possible_cpus [: 8] >375: 0001869c 176 FUNCGLOBAL DEFAULT1 btf__free > [: 8] [...] This is a patch on binutils carried by Fedora: https://src.fedoraproject.org/rpms/binutils/c/b8265c46f7ddae23a792ee8306fbaaeacba83bf8 " b8265c Have readelf display extra symbol information at the end of the line. " It has the following comment: # FIXME:The proper fix would be to update the scripts that are expecting # a fixed output from readelf. But it seems that some of them are # no longer being maintained. This commit is from 2017, had it been on binutils upstream, maybe the situation right now would be different. Honestly, it seems the best way out is to filter the other information in the libbpf Makefile. Does the following patch work for you? diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile index 56ce6292071b..e6f99484d7d5 100644 --- a/tools/lib/bpf/Makefile +++ b/tools/lib/bpf/Makefile @@ -145,6 +145,7 @@ PC_FILE := $(addprefix $(OUTPUT),$(PC_FILE)) GLOBAL_SYM_COUNT = $(shell readelf -s --wide $(BPF_IN_SHARED) | \ cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | \ + sed 's/\[.*\]//' | \ awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$8}' | \ sort -u | wc -l) VERSIONED_SYM_COUNT = $(shell readelf -s --wide $(OUTPUT)libbpf.so | \ @@ -217,6 +218,7 @@ check_abi: $(OUTPUT)libbpf.so "versioned in $(VERSION_SCRIPT)." >&2; \ readelf -s --wide $(OUTPUT)libbpf-in.o | \ cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | \ + sed 's/\[.*\]//' | \ awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$8}'| \ sort -u > $(OUTPUT)libbpf_global_syms.tmp; \ readelf -s --wide $(OUTPUT)libbpf.so | \
Re: [PATCH] libbpf: fix readelf output parsing on powerpc with recent binutils
On Tue, Dec 10, 2019 at 12:58:33PM -0600, Justin Forbes wrote: > On Mon, Dec 2, 2019 at 3:37 AM Daniel Borkmann wrote: > > > > On Mon, Dec 02, 2019 at 04:53:26PM +1100, Michael Ellerman wrote: > > > Aurelien Jarno writes: > > > > On powerpc with recent versions of binutils, readelf outputs an extra > > > > field when dumping the symbols of an object file. For example: > > > > > > > > 35: 083896 FUNCLOCAL DEFAULT [: 8] > > > > 1 btf_is_struct > > > > > > > > The extra "[: 8]" prevents the GLOBAL_SYM_COUNT variable to > > > > be computed correctly and causes the checkabi target to fail. > > > > > > > > Fix that by looking for the symbol name in the last field instead of the > > > > 8th one. This way it should also cope with future extra fields. > > > > > > > > Signed-off-by: Aurelien Jarno > > > > --- > > > > tools/lib/bpf/Makefile | 4 ++-- > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > Thanks for fixing that, it's been on my very long list of test failures > > > for a while. > > > > > > Tested-by: Michael Ellerman > > > > Looks good & also continues to work on x86. Applied, thanks! > > This actually seems to break horribly on PPC64le with binutils 2.33.1 > resulting in: > Warning: Num of global symbols in sharedobjs/libbpf-in.o (32) does NOT > match with num of versioned symbols in libbpf.so (184). Please make > sure all LIBBPF_API symbols are versioned in libbpf.map. > > This is the only arch that fails, with x86/arm/aarch64/s390 all > building fine. Reverting this patch allows successful build across > all arches. > > Justin Well, I ended up debugging this same issue and had the same fix as Jarno's when I noticed his fix was already applied. I just installed a system with the latest binutils, 2.33.1, and it still breaks without such fix. Can you tell what is the output of the following command on your system? readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $0}' Cascardo.
[PATCH] powerpc/vio: drop bus_type from parent device
Commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ("kobject: return error code if writing /sys/.../uevent fails") started returning failure when writing to /sys/devices/vio/uevent. This causes an early udevadm trigger to fail. On some installer versions of Ubuntu, this will cause init to exit, thus panicing the system very early during boot. Removing the bus_type from the parent device will remove some of the extra empty files from /sys/devices/vio/, but will keep the rest of the layout for vio devices, keeping them under /sys/devices/vio/. It has been tested that uevents for vio devices don't change after this fix, they still contain MODALIAS. Signed-off-by: Thadeu Lima de Souza Cascardo --- arch/powerpc/platforms/pseries/vio.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c index 6601b9d404dc..d570d5494f30 100644 --- a/arch/powerpc/platforms/pseries/vio.c +++ b/arch/powerpc/platforms/pseries/vio.c @@ -36,7 +36,6 @@ static struct vio_dev vio_bus_device = { /* fake "parent" device */ .name = "vio", .type = "", .dev.init_name = "vio", - .dev.bus = _bus_type, }; #ifdef CONFIG_PPC_SMLPAR -- 2.20.1
Re: [PATCH] powerpc/ptrace: Do not return ENOSYS if invalid syscall
On Tue, Sep 10, 2019 at 10:01:22PM -0300, Thadeu Lima de Souza Cascardo wrote: > If a tracer sets the syscall number to an invalid one, allow the return > value set by the tracer to be returned the tracee. > > The test for NR_syscalls is already at entry_64.S, and it's at > do_syscall_trace_enter only to skip audit and trace. > > After this, seccomp_bpf selftests complete just fine, as the failing test > was using ptrace to change the syscall to return an error or a fake value, > but were failing as it was always returning -ENOSYS. > > Signed-off-by: Thadeu Lima de Souza Cascardo > --- > arch/powerpc/kernel/ptrace.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c > index 8c92febf5f44..87315335f66a 100644 > --- a/arch/powerpc/kernel/ptrace.c > +++ b/arch/powerpc/kernel/ptrace.c > @@ -3316,7 +3316,7 @@ long do_syscall_trace_enter(struct pt_regs *regs) > > /* Avoid trace and audit when syscall is invalid. */ > if (regs->gpr[0] >= NR_syscalls) > - goto skip; > + return regs->gpr[0]; > > if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT))) > trace_sys_enter(regs, regs->gpr[0]); Ping? Any comments on this? Thanks. Cascardo.
[PATCH] powerpc/ptrace: Do not return ENOSYS if invalid syscall
If a tracer sets the syscall number to an invalid one, allow the return value set by the tracer to be returned the tracee. The test for NR_syscalls is already at entry_64.S, and it's at do_syscall_trace_enter only to skip audit and trace. After this, seccomp_bpf selftests complete just fine, as the failing test was using ptrace to change the syscall to return an error or a fake value, but were failing as it was always returning -ENOSYS. Signed-off-by: Thadeu Lima de Souza Cascardo --- arch/powerpc/kernel/ptrace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c index 8c92febf5f44..87315335f66a 100644 --- a/arch/powerpc/kernel/ptrace.c +++ b/arch/powerpc/kernel/ptrace.c @@ -3316,7 +3316,7 @@ long do_syscall_trace_enter(struct pt_regs *regs) /* Avoid trace and audit when syscall is invalid. */ if (regs->gpr[0] >= NR_syscalls) - goto skip; + return regs->gpr[0]; if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT))) trace_sys_enter(regs, regs->gpr[0]); -- 2.20.1
Re: [PATCH v3] ppc64 boot: Wait for boot cpu to show up if nr_cpus limit is about to hit.
Ping. Any further reviews/comments? Or any chance of this getting applied soon? Thanks. Cascardo.
Re: [PATCH v3] ppc64 boot: Wait for boot cpu to show up if nr_cpus limit is about to hit.
On Thu, Nov 16, 2017 at 07:26:43PM -0200, Guilherme G. Piccoli wrote: > On 11/06/2017 03:34 PM, Thadeu Lima de Souza Cascardo wrote: > > From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com> > > > > The kernel boot parameter 'nr_cpus=' allows one to specify number of > > possible cpus in the system. In the normal scenario the first cpu (cpu0) > > that shows up is the boot cpu and hence it gets covered under nr_cpus > > limit. > > > > But this assumption will be broken in kdump scenario where kdump kenrel > > Minor nit: s/kenrel/kernel > > > after a crash can boot up on an non-zero boot cpu. The paca structure > > allocation depends on value of nr_cpus and is indexed using logical cpu > > ids. This definetly will be an issue if boot cpu id > nr_cpus > > Minor nit (2): s/definetly/definitely Hi, Guilherme. Thanks a lot for testing and reviewing the patch. I will ignore those small typos, unless the maintainers tell me to fix them. I hope you don't mind. I would like to see this merged sooner rather than later. Thanks. Cascardo. > > > > This patch modifies allocate_pacas() and smp_setup_cpu_maps() to > > accommodate boot cpu for the case where boot_cpuid > nr_cpu_ids. > > > > This change would help to reduce the memory reservation requirement for > > kdump on ppc64. > > > > Signed-off-by: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com> > > Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@canonical.com> > > Tested-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com> > > Without this patch, got a crash deterministically when booting with > nr_cpus=1 in P8 bare-metal. The patch fixes the issue... > > Thanks, > > > Guilherme > > > > --- > > > > v3: fixup signedness or nr_cpus to match nr_cpu_ids > > and fix conflict due to change from %d to %u > > > > Resending this as it was not applied, and I can reproduce the issue with > > v4.14-rc8 when booting a kdump kernel after a crash that has been given > > nr_cpus=1 as a parameter. With this patch, I can't reproduce it anymore. > > > > --- > > arch/powerpc/include/asm/paca.h| 3 +++ > > arch/powerpc/include/asm/smp.h | 1 + > > arch/powerpc/kernel/paca.c | 23 +- > > arch/powerpc/kernel/prom.c | 39 > > +- > > arch/powerpc/kernel/setup-common.c | 25 > > 5 files changed, 85 insertions(+), 6 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/paca.h > > b/arch/powerpc/include/asm/paca.h > > index 04b60af027ae..ea0dbf2bbeef 100644 > > --- a/arch/powerpc/include/asm/paca.h > > +++ b/arch/powerpc/include/asm/paca.h > > @@ -49,6 +49,9 @@ extern unsigned int debug_smp_processor_id(void); /* from > > linux/smp.h */ > > #define get_lppaca() (get_paca()->lppaca_ptr) > > #define get_slb_shadow() (get_paca()->slb_shadow_ptr) > > > > +/* Maximum number of threads per core. */ > > +#defineMAX_SMT 8 > > + > > struct task_struct; > > > > /* > > diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h > > index fac963e10d39..553cd22b2ccc 100644 > > --- a/arch/powerpc/include/asm/smp.h > > +++ b/arch/powerpc/include/asm/smp.h > > @@ -30,6 +30,7 @@ > > #include > > > > extern int boot_cpuid; > > +extern int boot_hw_cpuid; > > extern int spinning_secondaries; > > > > extern void cpu_die(void); > > diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c > > index 2ff2b8a19f71..9c689ee4b6a3 100644 > > --- a/arch/powerpc/kernel/paca.c > > +++ b/arch/powerpc/kernel/paca.c > > @@ -207,6 +207,7 @@ void __init allocate_pacas(void) > > { > > u64 limit; > > int cpu; > > + unsigned int nr_cpus; > > > > limit = ppc64_rma_size; > > > > @@ -219,20 +220,32 @@ void __init allocate_pacas(void) > > limit = min(0x1000ULL, limit); > > #endif > > > > - paca_size = PAGE_ALIGN(sizeof(struct paca_struct) * nr_cpu_ids); > > + /* > > +* Always align up the nr_cpu_ids to SMT threads and allocate > > +* the paca. This will help us to prepare for a situation where > > +* boot cpu id > nr_cpus_id. We will use the last nthreads > > +* slots (nthreads == threads per core) to accommodate a core > > +* that contains boot cpu thread. > > +* > > +* Do not change nr_cpu_ids value here. Let us do that in > &
[PATCH v3] ppc64 boot: Wait for boot cpu to show up if nr_cpus limit is about to hit.
From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com> The kernel boot parameter 'nr_cpus=' allows one to specify number of possible cpus in the system. In the normal scenario the first cpu (cpu0) that shows up is the boot cpu and hence it gets covered under nr_cpus limit. But this assumption will be broken in kdump scenario where kdump kenrel after a crash can boot up on an non-zero boot cpu. The paca structure allocation depends on value of nr_cpus and is indexed using logical cpu ids. This definetly will be an issue if boot cpu id > nr_cpus This patch modifies allocate_pacas() and smp_setup_cpu_maps() to accommodate boot cpu for the case where boot_cpuid > nr_cpu_ids. This change would help to reduce the memory reservation requirement for kdump on ppc64. Signed-off-by: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com> Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@canonical.com> --- v3: fixup signedness or nr_cpus to match nr_cpu_ids and fix conflict due to change from %d to %u Resending this as it was not applied, and I can reproduce the issue with v4.14-rc8 when booting a kdump kernel after a crash that has been given nr_cpus=1 as a parameter. With this patch, I can't reproduce it anymore. --- arch/powerpc/include/asm/paca.h| 3 +++ arch/powerpc/include/asm/smp.h | 1 + arch/powerpc/kernel/paca.c | 23 +- arch/powerpc/kernel/prom.c | 39 +- arch/powerpc/kernel/setup-common.c | 25 5 files changed, 85 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h index 04b60af027ae..ea0dbf2bbeef 100644 --- a/arch/powerpc/include/asm/paca.h +++ b/arch/powerpc/include/asm/paca.h @@ -49,6 +49,9 @@ extern unsigned int debug_smp_processor_id(void); /* from linux/smp.h */ #define get_lppaca() (get_paca()->lppaca_ptr) #define get_slb_shadow() (get_paca()->slb_shadow_ptr) +/* Maximum number of threads per core. */ +#defineMAX_SMT 8 + struct task_struct; /* diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index fac963e10d39..553cd22b2ccc 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -30,6 +30,7 @@ #include extern int boot_cpuid; +extern int boot_hw_cpuid; extern int spinning_secondaries; extern void cpu_die(void); diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index 2ff2b8a19f71..9c689ee4b6a3 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -207,6 +207,7 @@ void __init allocate_pacas(void) { u64 limit; int cpu; + unsigned int nr_cpus; limit = ppc64_rma_size; @@ -219,20 +220,32 @@ void __init allocate_pacas(void) limit = min(0x1000ULL, limit); #endif - paca_size = PAGE_ALIGN(sizeof(struct paca_struct) * nr_cpu_ids); + /* +* Always align up the nr_cpu_ids to SMT threads and allocate +* the paca. This will help us to prepare for a situation where +* boot cpu id > nr_cpus_id. We will use the last nthreads +* slots (nthreads == threads per core) to accommodate a core +* that contains boot cpu thread. +* +* Do not change nr_cpu_ids value here. Let us do that in +* early_init_dt_scan_cpus() where we know exact value +* of threads per core. +*/ + nr_cpus = _ALIGN_UP(nr_cpu_ids, MAX_SMT); + paca_size = PAGE_ALIGN(sizeof(struct paca_struct) * nr_cpus); paca = __va(memblock_alloc_base(paca_size, PAGE_SIZE, limit)); memset(paca, 0, paca_size); printk(KERN_DEBUG "Allocated %u bytes for %u pacas at %p\n", - paca_size, nr_cpu_ids, paca); + paca_size, nr_cpus, paca); - allocate_lppacas(nr_cpu_ids, limit); + allocate_lppacas(nr_cpus, limit); - allocate_slb_shadows(nr_cpu_ids, limit); + allocate_slb_shadows(nr_cpus, limit); /* Can't use for_each_*_cpu, as they aren't functional yet */ - for (cpu = 0; cpu < nr_cpu_ids; cpu++) + for (cpu = 0; cpu < nr_cpus; cpu++) initialise_paca([cpu], cpu); } diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index f83056297441..93837093c5cb 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -302,6 +302,29 @@ static void __init check_cpu_feature_properties(unsigned long node) } } +/* + * Adjust the logical id of a boot cpu to fall under nr_cpu_ids. Map it to + * last core slot in the allocated paca array. + * + * e.g. on SMT=8 system, kernel booted with nr_cpus=1 and boot cpu = 33, + * align nr_cpu_ids to MAX_SMT value 8. Allocate paca array to hold up-to + * MAX_SMT=8 cpus. Since boot cpu 33 is greater than nr_cpus (8), adjust + * its logical id so that new id becomes less than nr_cpu_ids. Make s
[PATCH v2] powerpc: make /proc/self/stack always print the current stack
For the current task, the kernel stack would only tell the last time the process was rescheduled, if ever. Use the current stack pointer for the current task. Otherwise, every once in a while, the stacktrace printed when reading /proc/self/stack would look like the process is running in userspace, while it's not, which some may consider as a bug. This is also consistent with some other architectures, like x86 and arm, at least. Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@canonical.com> --- arch/powerpc/kernel/stacktrace.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c index 6671195..d534ed9 100644 --- a/arch/powerpc/kernel/stacktrace.c +++ b/arch/powerpc/kernel/stacktrace.c @@ -59,7 +59,14 @@ EXPORT_SYMBOL_GPL(save_stack_trace); void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace) { - save_context_stack(trace, tsk->thread.ksp, tsk, 0); + unsigned long sp; + + if (tsk == current) + sp = current_stack_pointer(); + else + sp = tsk->thread.ksp; + + save_context_stack(trace, sp, tsk, 0); } EXPORT_SYMBOL_GPL(save_stack_trace_tsk); -- 2.9.3
[PATCH] powerpc: fix /proc/self/stack
For the current task, the kernel stack would only tell the last time the process was rescheduled, if ever. Use the current stack pointer for the current task. This is also consistent with some other architectures. Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@canonical.com> --- arch/powerpc/kernel/stacktrace.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c index 6671195..2446066 100644 --- a/arch/powerpc/kernel/stacktrace.c +++ b/arch/powerpc/kernel/stacktrace.c @@ -59,7 +59,12 @@ EXPORT_SYMBOL_GPL(save_stack_trace); void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace) { - save_context_stack(trace, tsk->thread.ksp, tsk, 0); + unsigned long sp = tsk->thread.ksp; + + if (tsk == current) + sp = current_stack_pointer(); + + save_context_stack(trace, sp, tsk, 0); } EXPORT_SYMBOL_GPL(save_stack_trace_tsk); -- 2.9.3
Re: [PATCHv2 1/7] ppc bpf/jit: Disable classic BPF JIT on ppc64le
On Wed, Jun 22, 2016 at 09:55:01PM +0530, Naveen N. Rao wrote: > Classic BPF JIT was never ported completely to work on little endian > powerpc. However, it can be enabled and will crash the system when used. > As such, disable use of BPF JIT on ppc64le. Thanks, Naveen. Acked-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com> > > Cc: sta...@vger.kernel.org > Cc: Matt Evans <m...@ozlabs.org> > Cc: Denis Kirjanov <k...@linux-powerpc.org> > Cc: Michael Ellerman <m...@ellerman.id.au> > Cc: Paul Mackerras <pau...@samba.org> > Cc: Alexei Starovoitov <a...@fb.com> > Cc: Daniel Borkmann <dan...@iogearbox.net> > Cc: "David S. Miller" <da...@davemloft.net> > Cc: Ananth N Mavinakayanahalli <ana...@in.ibm.com> > Cc: Thadeu Lima de Souza Cascardo <casca...@redhat.com> > Reported-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com> > Signed-off-by: Naveen N. Rao <naveen.n@linux.vnet.ibm.com> > --- > arch/powerpc/Kconfig | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > index 01f7464..0a9d439 100644 > --- a/arch/powerpc/Kconfig > +++ b/arch/powerpc/Kconfig > @@ -128,7 +128,7 @@ config PPC > select IRQ_FORCED_THREADING > select HAVE_RCU_TABLE_FREE if SMP > select HAVE_SYSCALL_TRACEPOINTS > - select HAVE_CBPF_JIT > + select HAVE_CBPF_JIT if CPU_BIG_ENDIAN > select HAVE_ARCH_JUMP_LABEL > select ARCH_HAVE_NMI_SAFE_CMPXCHG > select ARCH_HAS_GCOV_PROFILE_ALL > -- > 2.8.2 > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] ppc: Fix BPF JIT for ABIv2
On Tue, Jun 21, 2016 at 09:15:48PM +1000, Michael Ellerman wrote: > On Tue, 2016-06-21 at 14:28 +0530, Naveen N. Rao wrote: > > On 2016/06/20 03:56PM, Thadeu Lima de Souza Cascardo wrote: > > > On Sun, Jun 19, 2016 at 11:19:14PM +0530, Naveen N. Rao wrote: > > > > On 2016/06/17 10:00AM, Thadeu Lima de Souza Cascardo wrote: > > > > > > > > > > Hi, Michael and Naveen. > > > > > > > > > > I noticed independently that there is a problem with BPF JIT and > > > > > ABIv2, and > > > > > worked out the patch below before I noticed Naveen's patchset and the > > > > > latest > > > > > changes in ppc tree for a better way to check for ABI versions. > > > > > > > > > > However, since the issue described below affect mainline and stable > > > > > kernels, > > > > > would you consider applying it before merging your two patchsets, so > > > > > that we can > > > > > more easily backport the fix? > > > > > > > > Hi Cascardo, > > > > Given that this has been broken on ABIv2 since forever, I didn't bother > > > > fixing it. But, I can see why this would be a good thing to have for > > > > -stable and existing distros. However, while your patch below may fix > > > > the crash you're seeing on ppc64le, it is not sufficient -- you'll need > > > > changes in bpf_jit_asm.S as well. > > > > > > Hi, Naveen. > > > > > > Any tips on how to exercise possible issues there? Or what changes you > > > think > > > would be sufficient? > > > > The calling convention is different with ABIv2 and so we'll need changes > > in bpf_slow_path_common() and sk_negative_common(). > > How big would those changes be? Do we know? > > How come no one reported this was broken previously? This is the first I've > heard of it being broken. > I just heard of it less than two weeks ago, and only could investigate it last week, when I realized mainline was also affected. It looks like the little-endian support for classic JIT were done before the conversion to ABIv2. And as JIT is disabled by default, no one seems to have exercised it. > > However, rather than enabling classic JIT for ppc64le, are we better off > > just disabling it? > > > > --- a/arch/powerpc/Kconfig > > +++ b/arch/powerpc/Kconfig > > @@ -128,7 +128,7 @@ config PPC > > select IRQ_FORCED_THREADING > > select HAVE_RCU_TABLE_FREE if SMP > > select HAVE_SYSCALL_TRACEPOINTS > > - select HAVE_CBPF_JIT > > + select HAVE_CBPF_JIT if CPU_BIG_ENDIAN > > select HAVE_ARCH_JUMP_LABEL > > select ARCH_HAVE_NMI_SAFE_CMPXCHG > > select ARCH_HAS_GCOV_PROFILE_ALL > > > > > > Michael, > > Let me know your thoughts on whether you intend to take this patch or > > Cascardo's patch for -stable before the eBPF patches. I can redo my > > patches accordingly. > > This patch sounds like the best option at the moment for something we can > backport. Unless the changes to fix it are minimal. > > cheers > With my patch only, I can run a minimal tcpdump tcp port 22 with success. It correctly filter packets. But as pointed out, slow paths may not be taken. I don't have strong opinions on what to apply to stable, just that it would be nice to have something for the crash before applying all the nice changes by Naveen. Cascardo. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] ppc: Fix BPF JIT for ABIv2
On Sun, Jun 19, 2016 at 11:19:14PM +0530, Naveen N. Rao wrote: > On 2016/06/17 10:00AM, Thadeu Lima de Souza Cascardo wrote: > > On Fri, Jun 17, 2016 at 10:53:21PM +1000, Michael Ellerman wrote: > > > On Tue, 2016-07-06 at 13:32:23 UTC, "Naveen N. Rao" wrote: > > > > diff --git a/arch/powerpc/net/bpf_jit_comp64.c > > > > b/arch/powerpc/net/bpf_jit_comp64.c > > > > new file mode 100644 > > > > index 000..954ff53 > > > > --- /dev/null > > > > +++ b/arch/powerpc/net/bpf_jit_comp64.c > > > > @@ -0,0 +1,956 @@ > > > ... > > > > + > > > > +static void bpf_jit_fill_ill_insns(void *area, unsigned int size) > > > > +{ > > > > + int *p = area; > > > > + > > > > + /* Fill whole space with trap instructions */ > > > > + while (p < (int *)((char *)area + size)) > > > > + *p++ = BREAKPOINT_INSTRUCTION; > > > > +} > > > > > > This breaks the build for some configs, presumably you're missing a > > > header: > > > > > > arch/powerpc/net/bpf_jit_comp64.c:30:10: error: > > > 'BREAKPOINT_INSTRUCTION' undeclared (first use in this function) > > > > > > http://kisskb.ellerman.id.au/kisskb/buildresult/12720611/ > > > > > > cheers > > > > Hi, Michael and Naveen. > > > > I noticed independently that there is a problem with BPF JIT and ABIv2, and > > worked out the patch below before I noticed Naveen's patchset and the latest > > changes in ppc tree for a better way to check for ABI versions. > > > > However, since the issue described below affect mainline and stable kernels, > > would you consider applying it before merging your two patchsets, so that > > we can > > more easily backport the fix? > > Hi Cascardo, > Given that this has been broken on ABIv2 since forever, I didn't bother > fixing it. But, I can see why this would be a good thing to have for > -stable and existing distros. However, while your patch below may fix > the crash you're seeing on ppc64le, it is not sufficient -- you'll need > changes in bpf_jit_asm.S as well. Hi, Naveen. Any tips on how to exercise possible issues there? Or what changes you think would be sufficient? I will see what I can find by myself, but would appreciate any help. Regards. Cascardo. > > Regards, > Naveen > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] ppc: Fix BPF JIT for ABIv2
On Fri, Jun 17, 2016 at 10:53:21PM +1000, Michael Ellerman wrote: > On Tue, 2016-07-06 at 13:32:23 UTC, "Naveen N. Rao" wrote: > > diff --git a/arch/powerpc/net/bpf_jit_comp64.c > > b/arch/powerpc/net/bpf_jit_comp64.c > > new file mode 100644 > > index 000..954ff53 > > --- /dev/null > > +++ b/arch/powerpc/net/bpf_jit_comp64.c > > @@ -0,0 +1,956 @@ > ... > > + > > +static void bpf_jit_fill_ill_insns(void *area, unsigned int size) > > +{ > > + int *p = area; > > + > > + /* Fill whole space with trap instructions */ > > + while (p < (int *)((char *)area + size)) > > + *p++ = BREAKPOINT_INSTRUCTION; > > +} > > This breaks the build for some configs, presumably you're missing a header: > > arch/powerpc/net/bpf_jit_comp64.c:30:10: error: 'BREAKPOINT_INSTRUCTION' > undeclared (first use in this function) > > http://kisskb.ellerman.id.au/kisskb/buildresult/12720611/ > > cheers Hi, Michael and Naveen. I noticed independently that there is a problem with BPF JIT and ABIv2, and worked out the patch below before I noticed Naveen's patchset and the latest changes in ppc tree for a better way to check for ABI versions. However, since the issue described below affect mainline and stable kernels, would you consider applying it before merging your two patchsets, so that we can more easily backport the fix? Thanks. Cascardo. --- From a984dc02b6317a1d3a3c2302385adba5227be5bd Mon Sep 17 00:00:00 2001 From: Thadeu Lima de Souza Cascardo <casca...@redhat.com> Date: Wed, 15 Jun 2016 13:22:12 -0300 Subject: [PATCH] ppc: Fix BPF JIT for ABIv2 ABIv2 used for ppc64le does not use function descriptors. Without this patch, whenever BPF JIT is enabled, we get a crash as below. [root@ibm-p8-kvm-05-guest-02 ~]# echo 2 > /proc/sys/net/core/bpf_jit_enable [root@ibm-p8-kvm-05-guest-02 ~]# tcpdump -n -i eth0 tcp port 22 device eth0 entered promiscuous mode Pass 1: shrink = 0, seen = 0x0 Pass 2: shrink = 0, seen = 0x0 flen=1 proglen=8 pass=3 image=d5bb9018 from=tcpdump pid=11387 JIT code: : 00 00 60 38 20 00 80 4e Pass 1: shrink = 0, seen = 0x3 Pass 2: shrink = 0, seen = 0x3 flen=20 proglen=524 pass=3 image=d5bbd018 from=tcpdump pid=11387 JIT code: : a6 02 08 7c 10 00 01 f8 70 ff c1 f9 78 ff e1 f9 JIT code: 0010: e1 fe 21 f8 7c 00 e3 80 78 00 e3 81 50 78 e7 7d JIT code: 0020: c8 00 c3 e9 00 00 a0 38 00 c0 e0 3c c6 07 e7 78 JIT code: 0030: 08 00 e7 64 54 1b e7 60 a6 03 e8 7c 0c 00 c0 38 JIT code: 0040: 21 00 80 4e b0 01 80 41 00 00 00 60 dd 86 e0 38 JIT code: 0050: 01 00 e7 3c 40 38 04 7c 9c 00 82 40 00 00 00 60 JIT code: 0060: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 70 1b e7 60 JIT code: 0070: a6 03 e8 7c 14 00 c0 38 21 00 80 4e 78 01 80 41 JIT code: 0080: 00 00 00 60 06 00 04 28 68 01 82 40 00 00 00 60 JIT code: 0090: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 54 1b e7 60 JIT code: 00a0: a6 03 e8 7c 36 00 c0 38 21 00 80 4e 48 01 80 41 JIT code: 00b0: 00 00 00 60 16 00 04 28 2c 01 82 41 00 00 00 60 JIT code: 00c0: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 54 1b e7 60 JIT code: 00d0: a6 03 e8 7c 38 00 c0 38 21 00 80 4e 18 01 80 41 JIT code: 00e0: 00 00 00 60 16 00 04 28 fc 00 82 41 00 00 00 60 JIT code: 00f0: 00 01 00 48 00 08 04 28 f8 00 82 40 00 00 00 60 JIT code: 0100: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 70 1b e7 60 JIT code: 0110: a6 03 e8 7c 17 00 c0 38 21 00 80 4e d8 00 80 41 JIT code: 0120: 00 00 00 60 06 00 04 28 c8 00 82 40 00 00 00 60 JIT code: 0130: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 54 1b e7 60 JIT code: 0140: a6 03 e8 7c 14 00 c0 38 21 00 80 4e a8 00 80 41 JIT code: 0150: 00 00 00 60 ff 1f 87 70 98 00 82 40 00 00 00 60 JIT code: 0160: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 88 1b e7 60 JIT code: 0170: a6 03 e8 7c 0e 00 c0 38 21 00 80 4e 78 00 80 41 JIT code: 0180: 00 00 00 60 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 JIT code: 0190: 4c 1b e7 60 a6 03 e8 7c 0e 00 c5 38 21 00 80 4e JIT code: 01a0: 54 00 80 41 00 00 00 60 16 00 04 28 38 00 82 41 JIT code: 01b0: 00 00 00 60 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 JIT code: 01c0: 4c 1b e7 60 a6 03 e8 7c 10 00 c5 38 21 00 80 4e JIT code: 01d0: 24 00 80 41 00 00 00 60 16 00 04 28 14 00 82 40 JIT code: 01e0: 00 00 00 60 ff ff 60 38 01 00 63 3c 08 00 00 48 JIT code: 01f0: 00 00 60 38 20 01 21 38 10 00 01 e8 a6 03 08 7c JIT code: 0200: 70 ff c1 e9 78 ff e1 e9 20 00 80 4e tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes Oops: Exception in kernel mode, sig: 4 [#1] SMP NR_CPUS=32 NUMA pSeries Modules linked in: virtio_balloon nfsd ip_tables x_tables autofs4 xfs libcrc32c virtio_console virtio_net virtio_pci virtio_ring virtio CPU: 1 PID: 0 Comm: swapper/1 Not t
Re: Generic IOMMU pooled allocator
On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org Date: Tue, 24 Mar 2015 13:08:10 +1100 For the large pool, we don't keep a hint so we don't know it's wrapped, in fact we purposefully don't use a hint to limit fragmentation on it, but then, it should be used rarely enough that flushing always is, I suspect, a good option. I can't think of any use case where the largepool would be hit a lot at all. Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a driver mapped a whole 64KiB page, it would hit the largepool. I have been suspicious for some time that after Anton's work on the pools, the large mappings optimization would throw away the benefit of using the 4 pools, since some drivers would always hit the largepool. Of course, drivers that map entire pages, when not buggy, are optimized already to avoid calling dma_map all the time. I worked on that for mlx4_en, and I would expect that its receive side would always hit the largepool. So, I decided to experiment and count the number of times that largealloc is true versus false. On the transmit side, or when using ICMP, I didn't notice many large allocations with qlge or cxgb4. However, when using large TCP send/recv (I used uperf with 64KB writes/reads), I noticed that on the transmit side, largealloc is not used, but on the receive side, cxgb4 almost only uses largealloc, while qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings. When turning GRO off, that ratio is closer to 1/10, meaning there is still some fair use of largealloc in that scenario. I confess my experiments are not complete. I would like to test a couple of other drivers as well, including mlx4_en and bnx2x, and test with small packet sizes. I suspected that MTU size could make a difference, but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I didn't notice any significant hit of largepool with either qlge or cxgb4. Also, we need to keep in mind that IOMMU_PAGE_SIZE is now dynamic in the latest code, with plans on using 64KiB in some situations, Alexey or Ben should have more details. But I believe that on the receive side, all drivers should map entire pages, using some allocation strategy similar to mlx4_en, in order to avoid DMA mapping all the time. Some believe that is bad for latency, and prefer to call something like skb_alloc for every package received, but I haven't seen any hard numbers, and I don't know why we couldn't make such an allocator as good as using something like the SLAB/SLUB allocator. Maybe there is a jitter problem, since the allocator has to go out and get some new pages and map them, once in a while. But I don't see why this would not be a problem with SLAB/SLUB as well. Calling dma_map is even worse with the current implementation. It's just that some architectures do no work at all when dma_map/unmap is called. Hope that helps consider the best strategy for the DMA space allocation as of now. Regards. Cascardo. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/2] PCI: One more parameter to pci_set_pcie_reset_state()
will affect other functions. We need to either have in written that this is acceptable and that higher layers should take care of these possibilities, if they care about them, or we need to fix this, by making VFIO only accept a whole group of functions to be assigned to a guest, which will allow the problem described by Gavin to be properly handled, by using slot reset functions, or another way of grouping devices that may be specific to the platform. Does it make sense? Cascardo. diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index daa68a1..cd85c18 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -726,11 +726,14 @@ static void *eeh_restore_dev_state(void *data, void *userdata) * pcibios_set_pcie_slot_reset - Set PCI-E reset state * @dev: pci device struct * @state: reset state to enter + * @probe: check if the device can take the reset * * Return value: * 0 if success */ -int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state state) +int pcibios_set_pcie_reset_state(struct pci_dev *dev, + enum pcie_reset_state state, + int probe) { struct eeh_dev *edev = pci_dev_to_eeh_dev(dev); struct eeh_pe *pe = eeh_dev_to_pe(edev); @@ -738,9 +741,12 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state stat if (!pe) { pr_err(%s: No PE found on PCI device %s\n, __func__, pci_name(dev)); - return -EINVAL; + return -ENOTTY; } + if (probe) + return 0; + switch (state) { case pcie_deassert_reset: eeh_ops-reset(pe, EEH_RESET_DEACTIVATE); @@ -762,8 +768,8 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state stat break; default: eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED); - return -EINVAL; - }; + return -ENOTTY; + } return 0; } diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c index 1ef0164..3a87bfc 100644 --- a/drivers/misc/cxl/pci.c +++ b/drivers/misc/cxl/pci.c @@ -789,7 +789,7 @@ int cxl_reset(struct cxl *adapter) /* pcie_warm_reset requests a fundamental pci reset which includes a * PERST assert/deassert. PERST triggers a loading of the image * if user or factory is selected in sysfs */ - if ((rc = pci_set_pcie_reset_state(dev, pcie_warm_reset))) { + if ((rc = pci_set_pcie_reset_state(dev, pcie_warm_reset, 0))) { dev_err(dev-dev, cxl: pcie_warm_reset failed\n); return rc; } diff --git a/drivers/misc/genwqe/card_base.c b/drivers/misc/genwqe/card_base.c index 4cf8f82..4871f69 100644 --- a/drivers/misc/genwqe/card_base.c +++ b/drivers/misc/genwqe/card_base.c @@ -782,17 +782,22 @@ static int genwqe_pci_fundamental_reset(struct pci_dev *pci_dev) { int rc; + /* Probe if the device can take the reset */ + rc = pci_set_pcie_reset_state(pci_dev, pcie_warm_reset, 1); + if (rc) + return rc; + /* * lock pci config space access from userspace, * save state and issue PCIe fundamental reset */ pci_cfg_access_lock(pci_dev); pci_save_state(pci_dev); - rc = pci_set_pcie_reset_state(pci_dev, pcie_warm_reset); + rc = pci_set_pcie_reset_state(pci_dev, pcie_warm_reset, 0); if (!rc) { /* keep PCIe reset asserted for 250ms */ msleep(250); - pci_set_pcie_reset_state(pci_dev, pcie_deassert_reset); + pci_set_pcie_reset_state(pci_dev, pcie_deassert_reset, 0); /* Wait for 2s to reload flash and train the link */ msleep(2000); } diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 81f06e8..8581a5f 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1558,28 +1558,31 @@ EXPORT_SYMBOL(pci_disable_device); * pcibios_set_pcie_reset_state - set reset state for device dev * @dev: the PCIe device reset * @state: Reset state to enter into - * + * @probe: Check if the device can take the reset * * Sets the PCIe reset state for the device. This is the default * implementation. Architecture implementations can override this. */ int __weak pcibios_set_pcie_reset_state(struct pci_dev *dev, - enum pcie_reset_state state) + enum pcie_reset_state state, + int
Re: [PATCH 0/4] Support registering specific reset handler
On Tue, Feb 17, 2015 at 09:36:47AM +1100, Gavin Shan wrote: On Mon, Feb 16, 2015 at 11:14:27AM -0200, casca...@linux.vnet.ibm.com wrote: On Fri, Feb 13, 2015 at 03:54:55PM +1100, Gavin Shan wrote: VFIO PCI infrastructure depends on pci_reset_function() to do reset on PCI devices so that they would be in clean state when host or guest grabs them. Unfortunately, the function doesn't work (or not well) on some PCI devices that require EEH PE reset. The patchset extends the quirk for PCI device speicific reset methods to allow dynamically registration. With it, we can translate reset requests for those special PCI devcies to EEH PE reset, which is only avaialble on 64-bits PowerPC platforms. Hi, Gavin. I like your approach overall. That allows us to confine these quirks to the platforms where they are relevant. I would make the quirks more specific, though, instead of doing them for all IBM and Mellanox devices. Yeah, we need have more specific vendor/device IDs for PATCH[4/4]. Could you please take a look on PATCH[4/4] and then suggest the specific devices that requries the platform-dependent reset quirk? Especially the device IDs for IBM/Mellanox we need put add quirks for. I wonder if we should not have some form of domain reset, where we would reset all the devices on the same group, and use that on vfio. Grouping the devices would then be made platform-dependent, as well as the reset method. On powernv, we would group by IOMMU group and issue a fundamental reset. I'm assuming domain reset is PE reset, which is the specific reset handler tries to do on PowerNV platform. The reason why we need platform specific reset handler is some adapters can't support function level reset methods (except pci_dev_specific_reset()) in __pci_dev_reset(). Well, in the case of Power servers, this would be the PE reset, I am not sure what this would be on other platforms. No other platform implements pci_set_pcie_reset_state, which I think would be one possible to implement such a reset. What I am saying is that we should consider doing this on all platforms and for all adapters, because this is not specific to Power and this is not specific to this particular set of adapters. Otherwise, one can simply program one adapter in the guest to write to host memory, shutdown, and since no reset will take place, the card will simply write to host memory when the guest is finished with and IOMMU is off. Regards. Cascardo. Thanks, Gavin Cascardo. Gavin Shan (4): PCI: Rename struct pci_dev_reset_methods PCI: Introduce list for device reset methods PCI: Allow registering reset method powerpc/powernv: Register PCI dev specific reset handlers arch/powerpc/platforms/powernv/pci.c | 61 +++ drivers/pci/pci.h| 3 +- drivers/pci/quirks.c | 139 ++- include/linux/pci.h | 9 +++ 4 files changed, 192 insertions(+), 20 deletions(-) -- 1.8.3.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/4] Support registering specific reset handler
On Fri, Feb 13, 2015 at 03:54:55PM +1100, Gavin Shan wrote: VFIO PCI infrastructure depends on pci_reset_function() to do reset on PCI devices so that they would be in clean state when host or guest grabs them. Unfortunately, the function doesn't work (or not well) on some PCI devices that require EEH PE reset. The patchset extends the quirk for PCI device speicific reset methods to allow dynamically registration. With it, we can translate reset requests for those special PCI devcies to EEH PE reset, which is only avaialble on 64-bits PowerPC platforms. Hi, Gavin. I like your approach overall. That allows us to confine these quirks to the platforms where they are relevant. I would make the quirks more specific, though, instead of doing them for all IBM and Mellanox devices. I wonder if we should not have some form of domain reset, where we would reset all the devices on the same group, and use that on vfio. Grouping the devices would then be made platform-dependent, as well as the reset method. On powernv, we would group by IOMMU group and issue a fundamental reset. Cascardo. Gavin Shan (4): PCI: Rename struct pci_dev_reset_methods PCI: Introduce list for device reset methods PCI: Allow registering reset method powerpc/powernv: Register PCI dev specific reset handlers arch/powerpc/platforms/powernv/pci.c | 61 +++ drivers/pci/pci.h| 3 +- drivers/pci/quirks.c | 139 ++- include/linux/pci.h | 9 +++ 4 files changed, 192 insertions(+), 20 deletions(-) -- 1.8.3.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH resend] powernv/iommu: disable IOMMU bypass with param iommu=nobypass
On Thu, Jan 22, 2015 at 12:05:31PM +1100, Michael Ellerman wrote: On Wed, 2015-01-21 at 13:23 -0200, Thadeu Lima de Souza Cascardo wrote: When IOMMU bypass is enabled, a PCI device can read and write memory that was not mapped by the driver without causing an EEH. That might cause memory corruption, for example. When we disable bypass, DMA reads and writes to addresses not mapped by the IOMMU will cause an EEH, allowing us to debug such issues. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com --- Documentation/kernel-parameters.txt |2 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 24 +++- 2 files changed, 25 insertions(+), 1 deletion(-) Is this unchanged since v2? Yes, it's unchanged, just rebased. Cascardo. It's marked as under review in patchwork which means I'm looking at it: http://patchwork.ozlabs.org/patch/402682/ Please check patchwork in future. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH resend] powernv/iommu: disable IOMMU bypass with param iommu=nobypass
When IOMMU bypass is enabled, a PCI device can read and write memory that was not mapped by the driver without causing an EEH. That might cause memory corruption, for example. When we disable bypass, DMA reads and writes to addresses not mapped by the IOMMU will cause an EEH, allowing us to debug such issues. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com --- Documentation/kernel-parameters.txt |2 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 24 +++- 2 files changed, 25 insertions(+), 1 deletion(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 4df73da..7dedfe5 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1493,6 +1493,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. forcesac soft pt [x86, IA-64] + nobypass[PPC/POWERNV] + Disable IOMMU bypass, using IOMMU for PCI devices. io7=[HW] IO7 for Marvel based alpha systems diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index fac88ed..f942a19 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -75,6 +75,27 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, #define pe_info(pe, fmt, ...) \ pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__) +static bool pnv_iommu_bypass_disabled __read_mostly; + +static int __init iommu_setup(char *str) +{ + if (!str) + return -EINVAL; + while (*str) { + if (!strncmp(str, nobypass, 8)) { + pnv_iommu_bypass_disabled = true; + pr_info(PowerNV: IOMMU bypass window disabled.\n); + } + str += strcspn(str, ,); + if (*str == ',') + str++; + } + + return 0; +} + +early_param(iommu, iommu_setup); + /* * stdcix is only supposed to be used in hypervisor real mode as per * the architecture spec @@ -1348,7 +1369,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, pnv_ioda_setup_bus_dma(pe, pe-pbus, true); /* Also create a bypass window */ - pnv_pci_ioda2_setup_bypass_pe(phb, pe); + if (!pnv_iommu_bypass_disabled) + pnv_pci_ioda2_setup_bypass_pe(phb, pe); return; fail: if (pe-tce32_seg = 0) -- 1.7.10.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powernv/iommu: disable IOMMU bypass with param iommu=nobypass
When IOMMU bypass is enabled, a PCI device can read and write memory that was not mapped by the driver without causing an EEH. That might cause memory corruption, for example. When we disable bypass, DMA reads and writes to addresses not mapped by the IOMMU will cause an EEH, allowing us to debug such issues. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com --- Documentation/kernel-parameters.txt |2 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 24 +++- 2 files changed, 25 insertions(+), 1 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 988160a..b03322a 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1454,6 +1454,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. forcesac soft pt [x86, IA-64] + nobypass[PPC/POWERNV] + Disable IOMMU bypass, using IOMMU for PCI devices. io7=[HW] IO7 for Marvel based alpha systems diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 468a0f2..a7d2f32 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -75,6 +75,27 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, #define pe_info(pe, fmt, ...) \ pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__) +static bool pnv_iommu_bypass_disabled __read_mostly; + +static int __init iommu_setup(char *str) +{ + if (!str) + return -EINVAL; + while (*str) { + if (!strncmp(str, nobypass, 8)) { + pnv_iommu_bypass_disabled = true; + pr_info(PowerNV: IOMMU bypass window disabled.\n); + } + str += strcspn(str, ,); + if (*str == ',') + str++; + } + + return 0; +} + +early_param(iommu, iommu_setup); + /* * stdcix is only supposed to be used in hypervisor real mode as per * the architecture spec @@ -1243,7 +1264,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, pnv_ioda_setup_bus_dma(pe, pe-pbus, true); /* Also create a bypass window */ - pnv_pci_ioda2_setup_bypass_pe(phb, pe); + if (!pnv_iommu_bypass_disabled) + pnv_pci_ioda2_setup_bypass_pe(phb, pe); return; fail: if (pe-tce32_seg = 0) -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powernv/iommu: disable IOMMU bypass with param iommu=nobypass
When IOMMU bypass is enabled, a PCI device can read and write memory that was not mapped by the driver without causing an EEH. That might cause memory corruption, for example. When we disable bypass, DMA reads and writes to addresses not mapped by the IOMMU will cause an EEH, allowing us to debug such issues. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- Documentation/kernel-parameters.txt |2 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 24 +++- 2 files changed, 25 insertions(+), 1 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 988160a..b03322a 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1454,6 +1454,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. forcesac soft pt [x86, IA-64] + nobypass[PPC/POWERNV] + Disable IOMMU bypass, using IOMMU for PCI devices. io7=[HW] IO7 for Marvel based alpha systems diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 468a0f2..58a5a27 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -75,6 +75,27 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, #define pe_info(pe, fmt, ...) \ pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__) +static int __read_mostly disable_bypass; + +static int __init iommu_setup(char *str) +{ + if (!*str) + return -EINVAL; + while (*str) { + if (!strncmp(str, nobypass, 8)) { + disable_bypass = 1; + pr_info(ppc iommu: disabling bypass.\n); + } + str += strcspn(str, ,); + if (*str == ',') + str++; + } + + return 0; +} + +early_param(iommu, iommu_setup); + /* * stdcix is only supposed to be used in hypervisor real mode as per * the architecture spec @@ -1243,7 +1264,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, pnv_ioda_setup_bus_dma(pe, pe-pbus, true); /* Also create a bypass window */ - pnv_pci_ioda2_setup_bypass_pe(phb, pe); + if (!disable_bypass) + pnv_pci_ioda2_setup_bypass_pe(phb, pe); return; fail: if (pe-tce32_seg = 0) -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] deb-pkg: Add support for powerpc little endian
On Fri, Sep 05, 2014 at 05:55:18PM +1000, Michael Neuling wrote: On Fri, 2014-09-05 at 09:13 +0200, Gabriel Paubert wrote: On Fri, Sep 05, 2014 at 03:28:47PM +1000, Michael Neuling wrote: The Debian powerpc little endian architecture is called ppc64le. This Huh? ppc64le or ppc64el? ppc64el. Commit message is wrong. Fixed below. Mikey What about ppc64? Also, I sent that already a month ago. Both linuxppc-dev and Michal Marek were on cc. http://marc.info/?l=linux-kernelm=140744360328562w=2 Cascardo. From: Michael Neuling mi...@neuling.org deb-pkg: Add support for powerpc little endian The Debian powerpc little endian architecture is called ppc64el. This is the default architecture used by Ubuntu for powerpc. The below checks the kernel config to see if we are compiling little endian and sets the Debian arch appropriately. Signed-off-by: Michael Neuling mi...@neuling.org diff --git a/scripts/package/builddeb b/scripts/package/builddeb index 35d5a58..6f4a1af 100644 --- a/scripts/package/builddeb +++ b/scripts/package/builddeb @@ -37,7 +37,7 @@ create_package() { s390*) debarch=s390$(grep -q CONFIG_64BIT=y $KCONFIG_CONFIG echo x || true) ;; ppc*) - debarch=powerpc ;; + debarch=$(grep -q CPU_LITTLE_ENDIAN=y $KCONFIG_CONFIG echo ppc64el || echo powerpc) ;; parisc*) debarch=hppa ;; mips*) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] deb-pkg: support ppc64 and ppc64el Debian archs
Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- scripts/package/builddeb |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/scripts/package/builddeb b/scripts/package/builddeb index 35d5a58..c26c28b 100644 --- a/scripts/package/builddeb +++ b/scripts/package/builddeb @@ -36,6 +36,8 @@ create_package() { debarch=sparc ;; s390*) debarch=s390$(grep -q CONFIG_64BIT=y $KCONFIG_CONFIG echo x || true) ;; + ppc64*) + debarch=ppc64$(grep -q CPU_LITTLE_ENDIAN=y $KCONFIG_CONFIG echo el || true) ;; ppc*) debarch=powerpc ;; parisc*) -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH] vfio-pci: avoid deadlock between unbind and VFIO_DEVICE_RESET
When we unbind vfio-pci from a device, while running a guest, we might have a deadlock when such a guest reboots. Unbind takes device_lock at device_release_driver, and waits for release_q at vfio_del_group_dev. release_q will only be woken up when all references to vfio_device are gone, and that includes open file descriptors, like the ones a guest on qemu will hold. If you try to reboot the guest, it will call VFIO_DEVICE_RESET, which calls pci_reset_function, which now grabs the device_lock, and we are deadlocked. Using device_trylock allow us to handle the case when the lock is already taken, and avoid this situation. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- Not tested yet, but I would like some comments now, like would it be better to have a pci_try_reset_function, or do trylock on pci_reset_function itself? --- drivers/vfio/pci/vfio_pci.c | 14 -- 1 files changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 3b76dc8..d1d2242 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -513,8 +513,18 @@ static long vfio_pci_ioctl(void *device_data, return ret; } else if (cmd == VFIO_DEVICE_RESET) { - return vdev-reset_works ? - pci_reset_function(vdev-pdev) : -EINVAL; + struct pci_dev *pdev = vdev-pdev; + int ret = -EBUSY; + if (!vdev-reset_works) + return -EINVAL; + if (pci_cfg_access_trylock(pdev)) { + if (device_trylock(pdev-dev)) { + ret = __pci_reset_function_locked(pdev); + device_unlock(pdev-dev); + } + pci_cfg_access_unlock(pdev); + } + return ret; } else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) { struct vfio_pci_hot_reset_info hdr; -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH] vfio-pci: avoid deadlock between unbind and VFIO_DEVICE_RESET
On Mon, Mar 03, 2014 at 08:09:22AM -0700, Alex Williamson wrote: On Mon, 2014-03-03 at 11:33 -0300, Thadeu Lima de Souza Cascardo wrote: When we unbind vfio-pci from a device, while running a guest, we might have a deadlock when such a guest reboots. Unbind takes device_lock at device_release_driver, and waits for release_q at vfio_del_group_dev. release_q will only be woken up when all references to vfio_device are gone, and that includes open file descriptors, like the ones a guest on qemu will hold. If you try to reboot the guest, it will call VFIO_DEVICE_RESET, which calls pci_reset_function, which now grabs the device_lock, and we are deadlocked. Using device_trylock allow us to handle the case when the lock is already taken, and avoid this situation. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- Not tested yet, but I would like some comments now, like would it be better to have a pci_try_reset_function, or do trylock on pci_reset_function itself? We already have it: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61cf16d8bd38c3dc52033ea75d5b1f8368514a17 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=890ed578df82f5b7b5a874f9f2fa4f117305df5f Is there something insufficient about these or are you testing on and older kernel? Thanks, Alex Sorry I missed it. On the rush to report and fix it, I looked only on my local branch. Should we backport those two patches to long term stable 3.10? I can reproduce the issue there. Thanks. Cascardo. --- drivers/vfio/pci/vfio_pci.c | 14 -- 1 files changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 3b76dc8..d1d2242 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -513,8 +513,18 @@ static long vfio_pci_ioctl(void *device_data, return ret; } else if (cmd == VFIO_DEVICE_RESET) { - return vdev-reset_works ? - pci_reset_function(vdev-pdev) : -EINVAL; + struct pci_dev *pdev = vdev-pdev; + int ret = -EBUSY; + if (!vdev-reset_works) + return -EINVAL; + if (pci_cfg_access_trylock(pdev)) { + if (device_trylock(pdev-dev)) { + ret = __pci_reset_function_locked(pdev); + device_unlock(pdev-dev); + } + pci_cfg_access_unlock(pdev); + } + return ret; } else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) { struct vfio_pci_hot_reset_info hdr; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc/eeh: drop taken reference to driver on eeh_rmv_device
Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 (powerpc/eeh: Use partial hotplug for EEH unaware drivers) introduces eeh_rmv_device, which may grab a reference to a driver, but not release it. That prevents a driver from being removed after it has gone through EEH recovery. This patch drops the reference if it was taken. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com Acked-by: Gavin Shan sha...@linux.vnet.ibm.com --- arch/powerpc/kernel/eeh_driver.c |8 ++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 7bb30dc..fdc679d 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -362,9 +362,13 @@ static void *eeh_rmv_device(void *data, void *userdata) */ if (!dev || (dev-hdr_type PCI_HEADER_TYPE_BRIDGE)) return NULL; + driver = eeh_pcid_get(dev); - if (driver driver-err_handler) - return NULL; + if (driver) { + eeh_pcid_put(dev); + if (driver-err_handler) + return NULL; + } /* Remove it from PCI subsystem */ pr_debug(EEH: Removing %s without EEH sensitive driver\n, -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powerpc/eeh: drop taken reference to driver on eeh_rmv_device
On Wed, Feb 05, 2014 at 10:43:38AM -0800, Nishanth Aravamudan wrote: On 05.02.2014 [16:20:45 -0200], Thadeu Lima de Souza Cascardo wrote: Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 (powerpc/eeh: Use partial hotplug for EEH unaware drivers) introduces eeh_rmv_device, which may grab a reference to a driver, but not release it. That prevents a driver from being removed after it has gone through EEH recovery. This patch drops the reference if it was taken. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com Acked-by: Gavin Shan sha...@linux.vnet.ibm.com --- arch/powerpc/kernel/eeh_driver.c |8 ++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 7bb30dc..fdc679d 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -362,9 +362,13 @@ static void *eeh_rmv_device(void *data, void *userdata) */ if (!dev || (dev-hdr_type PCI_HEADER_TYPE_BRIDGE)) return NULL; + This appears to be unnecessary whitespace change? -Nish Hi, Nish. I originally add it there for readability, giving both more evidence to the code below, where a driver reference is get and put, and to the code above and its respective comment. Leaving those together could give the impression the comment also applies to the code below. But I have no strong feelings about that. Ben, do you want me to send a new version? Regards. Cascardo. driver = eeh_pcid_get(dev); - if (driver driver-err_handler) - return NULL; + if (driver) { + eeh_pcid_put(dev); + if (driver-err_handler) + return NULL; + } /* Remove it from PCI subsystem */ pr_debug(EEH: Removing %s without EEH sensitive driver\n, -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Oops on shutdown : 3.11.0-5-powerpc-e500mc Ubuntu
On Tue, Feb 04, 2014 at 08:47:14AM -0600, John Donnelly wrote: Hi, Where is a appropriate place to file a bug for this : This is the second occurrence I have seen. If this is a Ubuntu kernel, probably report it at Canonical's Launchpad. If it is provided by Freescale, report it to Freescale. If you can run a mainline kernel (best now would be 3.14-rc1) and reproduce that, then best place would be linux-...@vger.kernel.org. Hope that helps. Cascardo. saucy)root@(none):/work/svy/tools# * Stopping MD monitoring service mdadm --monitor [ OK ] * Asking all remaining processes to terminate...[ OK ] * All processes ended within 1 seconds... [ OK ] [65426.692405] Oops: Exception in kernel mode, sig: 11 [#1] [65426.697721] SMP NR_CPUS=8 P4080 DS [65426.701122] Modules linked in: btrfs(F) raid6_pq(F) zlib_deflate(F) xor(F) ufs(F) qnx4(F) hfsplus(F) hfs(F) minix(F) ntfs(F) msdos(F) jfs(F) xfs(F) libcrc32c(F) reiserfs(F) nls_iso8859_1(F) nls_cp437(F) vfat(F) fat(F) dm_multipath(F) scsi_dh(F) usb_storage(F) ext2(F) rpcsec_gss_krb5(F) nfsv4(F) nfsd(F) auth_rpcgss(F) nfs_acl(F) nfs(F) lockd(F) sunrpc(F) fscache(F) ofpart(F) redboot(F) cmdlinepart(F) cfi_cmdset_0002(F) cfi_probe(F) cfi_util(F) gen_probe(F) physmap_of(F) map_funcs(F) chipreg(F) mtd(F) lm90(F) at24(F) caam(F) i2c_mpc(F) mpc85xx_edac(F) edac_core(F) uio_pdrv_genirq(F) uio(F) ata_generic(F) ahci(F) libahci(F) pata_jmicron(F) fsl_pq_mdio(F) [65426.759065] CPU: 6 PID: 19006 Comm: umount.nfs Tainted: GF 3.11.0-5-powerpc-e500mc #8-Ubuntu [65426.768461] task: ea0f3900 ti: ea05c000 task.ti: ea05c000 [65426.773858] NIP: f99a0f50 LR: f998250c CTR: c07f85c0 [65426.778820] REGS: ea05dda0 TRAP: 0700 Tainted: GF (3.11.0-5-powerpc-e500mc) [65426.786998] MSR: 00029002 CE,EE,ME CR: 24000422 XER: 2000 [65426.793107] GPR00: f9982d20 ea05de50 ea0f3900 e7e66ee0 00029002 0001 GPR08: eadc2504 0001 c07f85c0 1002e3ac 1001 8000 GPR16: 100112e4 1000f3c4 1000f3e0 1000f594 1000f59c 1000f440 c0b4fb6c 1001 GPR24: 1001 c0089180 ea05de84 f99ab064 ea2bf500 c0b41380 ea54ef00 [65426.822949] NIP [f99a0f50] rpc_remove_client_dir+0x0/0x30 [sunrpc] [65426.829143] LR [f998250c] __rpc_clnt_remove_pipedir+0x5c/0x80 [sunrpc] [65426.835668] Call Trace: [65426.838128] [ea05de50] [f999f9f0] rpc_get_sb_net+0x70/0xa0 [sunrpc] (unreliable) [65426.845542] [ea05de60] [f9982d20] rpc_free_client+0x50/0xa0 [sunrpc] [65426.851910] [ea05de70] [f9982c58] rpc_shutdown_client+0x68/0xe0 [sunrpc] [65426.858647] [ea05dec0] [f9a3fd04] nfs_free_server+0x124/0x190 [nfs] [65426.864920] [ea05dee0] [c01c5f34] deactivate_locked_super+0x74/0xa0 [65426.871193] [ea05def0] [c01e9184] SyS_umount+0x94/0x3d0 [65426.876422] [ea05df40] [c0010604] ret_from_syscall+0x0/0x3c [65426.882002] --- Exception: c01 at 0xfec55dc [65426.882002] LR = 0xff88d54 [65426.889225] Instruction dump: [65426.892191] 1b50 aa32 0706 0120 aa36 0706 1af8 aa3a [65426.899964] 0706 1aa0 aa46 0704 0058 aa4a 0704 1b74 [65426.907949] udlfb: /dev/fb0 FB_BLANK mode 1 -- 0 [65426.912666] udlfb: set_par mode 1024x768 [65426.927196] ---[ end trace b25b3039b99e0391 ]--- [65426.931816] umount.nfs: /work: not mounted [65426.960296] init: idmapd main process (2394) killed by TERM signal [65426.983998] init: systemd-logind main process (544) killed by TERM signal [65426.994070] Oops: Exception in kernel mode, sig: 11 [#2] [65426.999386] SMP NR_CPUS=8 P4080 DS [65427.002787] Modules linked in: btrfs(F) raid6_pq(F) zlib_deflate(F) xor(F) ufs(F) qnx4(F) hfsplus(F) hfs(F) minix(F) ntfs(F) msdos(F) jfs(F) xfs(F) libcrc32c(F) reiserfs(F) nls_iso8859_1(F) nls_cp437(F) vfat(F) fat(F) dm_multipath(F) scsi_dh(F) usb_storage(F) ext2(F) rpcsec_gss_krb5(F) nfsv4(F) nfsd(F) auth_rpcgss(F) nfs_acl(F) nfs(F) lockd(F) sunrpc(F) fscache(F) ofpart(F) redboot(F) cmdlinepart(F) cfi_cmdset_0002(F) cfi_probe(F) cfi_util(F) gen_probe(F) physmap_of(F) map_funcs(F) chipreg(F) mtd(F) lm90(F) at24(F) caam(F) i2c_mpc(F) mpc85xx_edac(F) edac_core(F) uio_pdrv_genirq(F) uio(F) ata_generic(F) ahci(F) libahci(F) pata_jmicron(F) fsl_pq_mdio(F) [65427.060734] CPU: 7 PID: 19014 Comm: umount Tainted: GF D 3.11.0-5-powerpc-e500mc #8-Ubuntu [65427.069783] task: e5d55580 ti: ea2f task.ti: ea2f [65427.075180] NIP: f99a0270 LR: c01c5f34 CTR: f99a0270 [65427.080142] REGS: ea2f1e30 TRAP: 0700 Tainted: GF D (3.11.0-5-powerpc-e500mc) [65427.088320] MSR: 00029002 CE,EE,ME CR: 24004424 XER: [65427.094430] GPR00: c01c5f24 ea2f1ee0 e5d55580 ea1e7800 c1d404e0 ea027a80 c01e7b78 0001 GPR08: f99a0270 000d 003fe0f9 100192e0 1001 8000 GPR16: 100112e4 1000f3c4 1000f3e0 1000f594 1000f59c ea027aa0 c0b4fb6c ea2f1ef8 GPR24: ea027a80 ea027a80 ea027aa0
Re: [PATCH] powerpc/eeh: drop taken reference to driver on eeh_rmv_device
On Fri, Jan 31, 2014 at 08:46:11AM +0800, Gavin Shan wrote: On Thu, Jan 30, 2014 at 11:00:48AM -0200, Thadeu Lima de Souza Cascardo wrote: Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 (powerpc/eeh: Use partial hotplug for EEH unaware drivers) introduces eeh_rmv_device, which may grab a reference to a driver, but not release it. That prevents a driver from being removed after it has gone through EEH recovery. This patch drops the reference in either exit path if it was taken. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/kernel/eeh_driver.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 7bb30dc..afe7337 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -364,7 +364,7 @@ static void *eeh_rmv_device(void *data, void *userdata) return NULL; driver = eeh_pcid_get(dev); if (driver driver-err_handler) -return NULL; +goto out; /* Remove it from PCI subsystem */ pr_debug(EEH: Removing %s without EEH sensitive driver\n, @@ -377,6 +377,9 @@ static void *eeh_rmv_device(void *data, void *userdata) For normal case (driver without EEH support), we probably release the reference to the driver before pci_stop_and_remove_bus_device(). You are right, we need to call it before we call pci_stop_and_remove_bus_device, otherwise dev-driver will be NULL, and eeh_pcid_put will not do module_put. On the other hand, we could change the call to eeh_pcid_put to accept struct pci_driver instead. pci_stop_and_remove_bus_device(dev); pci_unlock_rescan_remove(); +out: +if (driver) +eeh_pcid_put(dev); return NULL; We needn't if (driver) here as eeh_pcid_put() already had the check. What if try_module_get returned false on eeh_pcid_get? How about something like the patch below? } Thanks, Gavin --- diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 7bb30dc..3a397fa 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -352,6 +352,7 @@ static void *eeh_rmv_device(void *data, void *userdata) struct eeh_dev *edev = (struct eeh_dev *)data; struct pci_dev *dev = eeh_dev_to_pci_dev(edev); int *removed = (int *)userdata; + bool has_err_handler; /* * Actually, we should remove the PCI bridges as well. @@ -362,8 +363,12 @@ static void *eeh_rmv_device(void *data, void *userdata) */ if (!dev || (dev-hdr_type PCI_HEADER_TYPE_BRIDGE)) return NULL; + driver = eeh_pcid_get(dev); - if (driver driver-err_handler) + has_err_handler = driver driver-err_handler; + if (driver) + eeh_pcid_put(dev); + if (has_err_handler) return NULL; /* Remove it from PCI subsystem */ --- ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/eeh: drop taken reference to driver on eeh_rmv_device
Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 (powerpc/eeh: Use partial hotplug for EEH unaware drivers) introduces eeh_rmv_device, which may grab a reference to a driver, but not release it. That prevents a driver from being removed after it has gone through EEH recovery. This patch drops the reference in either exit path if it was taken. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/kernel/eeh_driver.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 7bb30dc..afe7337 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -364,7 +364,7 @@ static void *eeh_rmv_device(void *data, void *userdata) return NULL; driver = eeh_pcid_get(dev); if (driver driver-err_handler) - return NULL; + goto out; /* Remove it from PCI subsystem */ pr_debug(EEH: Removing %s without EEH sensitive driver\n, @@ -377,6 +377,9 @@ static void *eeh_rmv_device(void *data, void *userdata) pci_stop_and_remove_bus_device(dev); pci_unlock_rescan_remove(); +out: + if (driver) + eeh_pcid_put(dev); return NULL; } -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powernv: fix VFIO support with PHB3
I have recently found out that no iommu_groups could be found under /sys/ on a P8. That prevents PCI passthrough from working. During my investigation, I found out there seems to be a missing iommu_register_group for PHB3. The following patch seems to fix the problem. After applying it, I see iommu_groups under /sys/kernel/iommu_groups/, and can also bind vfio-pci to an adapter, which gives me a device at /dev/vfio/. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- This is now applied on top of benh's tree, branch next. Alexey, is this now OK for you? Thanks. Cascardo. --- arch/powerpc/platforms/powernv/pci-ioda.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 614356c..f0e6871 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -720,6 +720,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, tbl-it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE; } iommu_init_table(tbl, phb-hose-node); + iommu_register_group(tbl, pci_domain_nr(pe-pbus), pe-pe_number); if (pe-pdev) set_iommu_table_base_and_group(pe-pdev-dev, tbl); -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powernv: fix VFIO support with PHB3
On Sat, Dec 07, 2013 at 11:58:44AM +1100, Alexey Kardashevskiy wrote: On 12/06/2013 11:21 PM, Thadeu Lima de Souza Cascardo wrote: I have recently found out that no iommu_groups could be found under /sys/ on a P8. That prevents PCI passthrough from working. During my investigation, I found out there seems to be a missing iommu_register_group for PHB3. The following patch seems to fix the problem. After applying it, I see iommu_groups under /sys/kernel/iommu_groups/, and can also bind vfio-pci to an adapter, which gives me a device at /dev/vfio/. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/platforms/powernv/pci-ioda.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 084cdfa..2c6d173 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -720,6 +720,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, tbl-it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE; } iommu_init_table(tbl, phb-hose-node); + iommu_register_group(tbl, pci_domain_nr(pe-pbus), pe-pe_number); if (pe-pdev) set_iommu_table_base(pe-pdev-dev, tbl); This does not seem absolutely right - normally set_iommu_table_base() is replaced with set_iommu_table_base_and_group() or you will not see some devices in a group and this may make VFIO unhappy. Alexey, Your patch [PATCH v9] PPC: POWERNV: move iommu_add_device earlier is not upstream yet. It calls set_iommu_table_base_and_group properly. However, without calling iommu_register_group, there is no group to add the device to. Compare to pnv_pci_ioda_setup_dma_pe which calls iommu_register_group too. My patch fixes this discrepancy between IODA and IODA2. Regards. Cascardo. But - if every single device gets assigned to some group, then we can push this to Frobisher and let people test in on power8. -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powernv: fix VFIO support with PHB3
I have recently found out that no iommu_groups could be found under /sys/ on a P8. That prevents PCI passthrough from working. During my investigation, I found out there seems to be a missing iommu_register_group for PHB3. The following patch seems to fix the problem. After applying it, I see iommu_groups under /sys/kernel/iommu_groups/, and can also bind vfio-pci to an adapter, which gives me a device at /dev/vfio/. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/platforms/powernv/pci-ioda.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 084cdfa..2c6d173 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -720,6 +720,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, tbl-it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE; } iommu_init_table(tbl, phb-hose-node); + iommu_register_group(tbl, pci_domain_nr(pe-pbus), pe-pe_number); if (pe-pdev) set_iommu_table_base(pe-pdev-dev, tbl); -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/iommu: use GFP_KERNEL instead of GFP_ATOMIC in iommu_init_table()
On Tue, Oct 01, 2013 at 02:04:53PM -0700, Nishanth Aravamudan wrote: Under heavy (DLPAR?) stress, we tripped this panic() in arch/powerpc/kernel/iommu.c::iommu_init_table(): page = alloc_pages_node(nid, GFP_ATOMIC, get_order(sz)); if (!page) panic(iommu_init_table: Can't allocate %ld bytes\n, sz); Before the panic() we got a page allocation failure for an order-2 allocation. There appears to be memory free, but perhaps not in the ATOMIC context. I looked through all the call-sites of iommu_init_table() and didn't see any obvious reason to need an ATOMIC allocation. Most call-sites in fact have an explicit GFP_KERNEL allocation shortly before the call to iommu_init_table(), indicating we are not in an atomic context. There is some indirection for some paths, but I didn't see any locks indicating that GFP_KERNEL is inappropriate. With this change under the same conditions, we have not been able to reproduce the panic. Signed-off-by: Nishanth Aravamudan n...@us.ibm.com diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 0adab06..572bb5b 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -661,7 +661,7 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid) /* number of bytes needed for the bitmap */ sz = BITS_TO_LONGS(tbl-it_size) * sizeof(unsigned long); - page = alloc_pages_node(nid, GFP_ATOMIC, get_order(sz)); + page = alloc_pages_node(nid, GFP_KERNEL, get_order(sz)); if (!page) panic(iommu_init_table: Can't allocate %ld bytes\n, sz); tbl-it_map = page_address(page); I didn't respond to the previous message, but also checked if there were any history on the logs, and found this was as it is from the start. I also found no other reasons why it needs to be atomic. Therefore, Acked-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] KVM: PPC: Book3S PR: return appropriate error when allocation fails
err was overwritten by a previous function call, and checked to be 0. If the following page allocation fails, 0 is going to be returned instead of -ENOMEM. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_pr.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index 19498a5..c6e13d9 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -1047,11 +1047,12 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) if (err) goto free_shadow_vcpu; + err = -ENOMEM; p = __get_free_page(GFP_KERNEL|__GFP_ZERO); - /* the real shared page fills the last 4k of our page */ - vcpu-arch.shared = (void*)(p + PAGE_SIZE - 4096); if (!p) goto uninit_vcpu; + /* the real shared page fills the last 4k of our page */ + vcpu-arch.shared = (void *)(p + PAGE_SIZE - 4096); #ifdef CONFIG_PPC_BOOK3S_64 /* default to book3s_64 (970fx) */ -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] eeh: add eeh_dev to the cache during boot
commit f8f7d63fd96ead101415a1302035137a866f8998 (powerpc/eeh: Trace eeh device from I/O cache) broke EEH on pseries for devices that were present during boot and have not been hotplugged/DLPARed. eeh_check_failure will get the eeh_dev from the cache, and will get NULL. eeh_addr_cache_build adds the addresses to the cache, but eeh_dev for the giving pci_device is not set yet. Just reordering the call to eeh_addr_cache_insert_dev works fine. The ordering is similar to the one in eeh_add_device_late. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- Note that this is broken since v3.7-rc1 and this patch applies on top of v3.10-rc7. Just changing the patches will apply it on top of next-20130626, where it has moved to arch/powerpc/kernel/. I also realized when writing the log that maybe calling eeh_add_device_late instead could be a better option. Please, share your comments. --- arch/powerpc/platforms/pseries/eeh_cache.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/eeh_cache.c b/arch/powerpc/platforms/pseries/eeh_cache.c index 5a4c879..5ce3ba7 100644 --- a/arch/powerpc/platforms/pseries/eeh_cache.c +++ b/arch/powerpc/platforms/pseries/eeh_cache.c @@ -294,8 +294,6 @@ void __init eeh_addr_cache_build(void) spin_lock_init(pci_io_addr_cache_root.piar_lock); for_each_pci_dev(dev) { - eeh_addr_cache_insert_dev(dev); - dn = pci_device_to_OF_node(dev); if (!dn) continue; @@ -308,6 +306,8 @@ void __init eeh_addr_cache_build(void) dev-dev.archdata.edev = edev; edev-pdev = dev; + eeh_addr_cache_insert_dev(dev); + eeh_sysfs_add_device(dev); } -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] ppc64: Add arch-specific pcie_get_speed_cap_mask
On Thu, Mar 14, 2013 at 02:45:47PM -0300, luca...@linux.vnet.ibm.com wrote: From: Lucas Kannebley Tavares luca...@linux.vnet.ibm.com Betters support for gen2 speed detections on PCI buses on ppc64 architectures. Signed-off-by: Lucas Kannebley Tavares luca...@linux.vnet.ibm.com --- arch/powerpc/platforms/pseries/pci.c | 32 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/platforms/pseries/pci.c b/arch/powerpc/platforms/pseries/pci.c index 0b580f4..58469fe 100644 --- a/arch/powerpc/platforms/pseries/pci.c +++ b/arch/powerpc/platforms/pseries/pci.c @@ -24,6 +24,7 @@ #include linux/kernel.h #include linux/pci.h #include linux/string.h +#include linux/device.h #include asm/eeh.h #include asm/pci-bridge.h @@ -108,3 +109,34 @@ static void fixup_winbond_82c105(struct pci_dev* dev) } DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_WINBOND, PCI_DEVICE_ID_WINBOND_82C105, fixup_winbond_82c105); + +int pcibios_get_speed_cap_mask(struct pci_dev *dev, u32 *mask) +{ + struct device_node *dn, *pdn; + const uint32_t *pcie_link_speed_stats = NULL; + + *mask = 0; + dn = pci_bus_to_OF_node(dev-bus); + + /* Find nearest ibm,pcie-link-speed-stats, walking up the device tree */ + for (pdn = dn; pdn != NULL; pdn = pdn-parent) { + pcie_link_speed_stats = (const uint32_t *) of_get_property(pdn, + ibm,pcie-link-speed-stats, NULL); + if (pcie_link_speed_stats != NULL) + break; + } + + if (pcie_link_speed_stats == NULL) { + dev_info(dev-dev, no ibm,pcie-link-speed-stats property\n); + return -EINVAL; + } + + switch (pcie_link_speed_stats[0]) { + case 0x02: + *mask |= PCIE_SPEED_50; + case 0x01: + *mask |= PCIE_SPEED_25; + } I recall seeing this returns 0x00 as well. Maybe you should include both 0x00 and a default case and return EINVAL. Regards. Cascardo. + + return 0; +} -- 1.7.4.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] ppc/iommu: use find_first_bit to look up entries in the iommu table
On Tue, Jan 29, 2013 at 11:35:56AM +1100, Benjamin Herrenschmidt wrote: On Thu, 2013-01-10 at 17:33 -0200, Thadeu Lima de Souza Cascardo wrote: Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- v2: Remove the unneeded extra variable i, which caused build failure. I believe something equivalent is already in -next, can you dbl check ? Cheers, Ben. There is, and it's using bitmap_empty, which is even more clear. Thanks. Cascardo. --- arch/powerpc/kernel/iommu.c |9 ++--- 1 files changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 6d48ff8..0fc44d2 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -708,7 +708,7 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid) void iommu_free_table(struct iommu_table *tbl, const char *node_name) { - unsigned long bitmap_sz, i; + unsigned long bitmap_sz; unsigned int order; if (!tbl || !tbl-it_map) { @@ -725,14 +725,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) clear_bit(0, tbl-it_map); /* verify that table contains no entries */ - /* it_size is in entries, and we're examining 64 at a time */ - for (i = 0; i (tbl-it_size/64); i++) { - if (tbl-it_map[i] != 0) { + if (find_first_bit(tbl-it_map, tbl-it_size) tbl-it_size) printk(KERN_WARNING %s: Unexpected TCEs for %s\n, __func__, node_name); - break; - } - } /* calculate bitmap size in bytes */ bitmap_sz = (tbl-it_size + 7) / 8; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] ppc/iommu: use find_first_bit to look up entries in the iommu table
Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- v2: Remove the unneeded extra variable i, which caused build failure. --- arch/powerpc/kernel/iommu.c |9 ++--- 1 files changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 6d48ff8..0fc44d2 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -708,7 +708,7 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid) void iommu_free_table(struct iommu_table *tbl, const char *node_name) { - unsigned long bitmap_sz, i; + unsigned long bitmap_sz; unsigned int order; if (!tbl || !tbl-it_map) { @@ -725,14 +725,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) clear_bit(0, tbl-it_map); /* verify that table contains no entries */ - /* it_size is in entries, and we're examining 64 at a time */ - for (i = 0; i (tbl-it_size/64); i++) { - if (tbl-it_map[i] != 0) { + if (find_first_bit(tbl-it_map, tbl-it_size) tbl-it_size) printk(KERN_WARNING %s: Unexpected TCEs for %s\n, __func__, node_name); - break; - } - } /* calculate bitmap size in bytes */ bitmap_sz = (tbl-it_size + 7) / 8; -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] ppc/iommu: prevent false TCE leak message
On Fri, Dec 28, 2012 at 01:21:35PM +0800, Gavin Shan wrote: On Thu, Dec 27, 2012 at 02:28:06PM -0200, Thadeu Lima de Souza Cascardo wrote: When a device DMA window includes the address 0, it's reserved in the TCE bitmap to avoid returning that address to drivers. When the device is removed, the bitmap is checked for any mappings not removed by the driver, indicating a possible DMA mapping leak. Since the reserved address is not cleared, a message is printed, warning of such a leak. Check for the reservation, and clear it before checking for any other standing mappings. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/kernel/iommu.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 8226c6c..226e9e5 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -717,6 +717,11 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) return; } +/* In case we have reserved the first bit, we should not emit + * the warning below. */ At least, the comment would look like: /* * xxx */ Sure. New code should always follow coding style. :-) How do you suggest merging with the comment below? I think it's closer to the code it comments about, so I'd rather keep it where it is. +if (tbl-it_offset == 0) +clear_bit(0, tbl-it_map); + /* verify that table contains no entries */ /* it_size is in entries, and we're examining 64 at a time */ The comment would be merged as well? :-) I also considered replacing this code by this: /* verify that table contains no entries */ - /* it_size is in entries, and we're examining 64 at a time */ - for (i = 0; i (tbl-it_size/64); i++) { - if (tbl-it_map[i] != 0) { + if (find_first_bit(tbl-it_map, tbl-it_size) tbl-it_size) printk(KERN_WARNING %s: Unexpected TCEs for %s\n, __func__, node_name); - break; - } - } I'll resend the unreserving patch with the fixed comment styling, but without merging comments. And I will send this other patch for comments. Regards. Thadeu Cascardo. for (i = 0; i (tbl-it_size/64); i++) { Thanks, Gavin ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] ppc/iommu: prevent false TCE leak message
When a device DMA window includes the address 0, it's reserved in the TCE bitmap to avoid returning that address to drivers. When the device is removed, the bitmap is checked for any mappings not removed by the driver, indicating a possible DMA mapping leak. Since the reserved address is not cleared, a message is printed, warning of such a leak. Check for the reservation, and clear it before checking for any other standing mappings. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/kernel/iommu.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 8226c6c..6d48ff8 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -717,6 +717,13 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) return; } + /* +* In case we have reserved the first bit, we should not emit +* the warning below. +*/ + if (tbl-it_offset == 0) + clear_bit(0, tbl-it_map); + /* verify that table contains no entries */ /* it_size is in entries, and we're examining 64 at a time */ for (i = 0; i (tbl-it_size/64); i++) { -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/2] ppc/iommu: use find_first_bit to look up entries in the iommu table
Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/kernel/iommu.c |7 +-- 1 files changed, 1 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 6d48ff8..91e2b99 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -725,14 +725,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) clear_bit(0, tbl-it_map); /* verify that table contains no entries */ - /* it_size is in entries, and we're examining 64 at a time */ - for (i = 0; i (tbl-it_size/64); i++) { - if (tbl-it_map[i] != 0) { + if (find_first_bit(tbl-it_map, tbl-it_size) tbl-it_size) printk(KERN_WARNING %s: Unexpected TCEs for %s\n, __func__, node_name); - break; - } - } /* calculate bitmap size in bytes */ bitmap_sz = (tbl-it_size + 7) / 8; -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] EEH/OF: checking for CONFIG_EEH is not needed
The functions used are already defined as empty inline functions for the case where EEH is disabled. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/kernel/of_platform.c |4 1 files changed, 0 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/of_platform.c b/arch/powerpc/kernel/of_platform.c index 2049f2d..c5fc6b2 100644 --- a/arch/powerpc/kernel/of_platform.c +++ b/arch/powerpc/kernel/of_platform.c @@ -71,10 +71,8 @@ static int __devinit of_pci_phb_probe(struct platform_device *dev) eeh_dev_phb_init_dynamic(phb); /* Register devices with EEH */ -#ifdef CONFIG_EEH if (dev-dev.of_node-child) eeh_add_device_tree_early(dev-dev.of_node); -#endif /* CONFIG_EEH */ /* Scan the bus */ pcibios_scan_phb(phb); @@ -88,9 +86,7 @@ static int __devinit of_pci_phb_probe(struct platform_device *dev) pcibios_claim_one_bus(phb-bus); /* Finish EEH setup */ -#ifdef CONFIG_EEH eeh_add_device_tree_late(phb-bus); -#endif /* Add probed PCI devices to the device model */ pci_bus_add_devices(phb-bus); -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/2] ppc/EEH: fix crash when adding a device in a slot with DDW
The DDW code uses a eeh_dev struct from the pci_dev. However, this is not set until eeh_add_device_late is called. Since pci_bus_add_devices is called before eeh_add_device_late, the PCI devices are added to the bus, making drivers' probe hooks to be called. These will call set_dma_mask, which will call the DDW code, which will require the eeh_dev struct from pci_dev. This would result in a crash, due to a NULL dereference. Calling eeh_add_device_late after pci_bus_add_devices would make the system BUG, because device files shouldn't be added to devices there were not added to the system. So, a new function is needed to add such files only after pci_bus_add_devices have been called. Cc: sta...@vger.kernel.org Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/include/asm/eeh.h |3 +++ arch/powerpc/kernel/of_platform.c|3 +++ arch/powerpc/kernel/pci-common.c |7 +-- arch/powerpc/platforms/pseries/eeh.c | 24 +++- 4 files changed, 34 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index b0ef738..0f816da 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -201,6 +201,7 @@ int eeh_dev_check_failure(struct eeh_dev *edev); void __init eeh_addr_cache_build(void); void eeh_add_device_tree_early(struct device_node *); void eeh_add_device_tree_late(struct pci_bus *); +void eeh_add_sysfs_files(struct pci_bus *); void eeh_remove_bus_device(struct pci_dev *, int); /** @@ -240,6 +241,8 @@ static inline void eeh_add_device_tree_early(struct device_node *dn) { } static inline void eeh_add_device_tree_late(struct pci_bus *bus) { } +static inline void eeh_add_sysfs_files(struct pci_bus *bus) { } + static inline void eeh_remove_bus_device(struct pci_dev *dev, int purge_pe) { } static inline void eeh_lock(void) { } diff --git a/arch/powerpc/kernel/of_platform.c b/arch/powerpc/kernel/of_platform.c index c5fc6b2..500dd32 100644 --- a/arch/powerpc/kernel/of_platform.c +++ b/arch/powerpc/kernel/of_platform.c @@ -91,6 +91,9 @@ static int __devinit of_pci_phb_probe(struct platform_device *dev) /* Add probed PCI devices to the device model */ pci_bus_add_devices(phb-bus); + /* sysfs files should only be added after devices are added */ + eeh_add_sysfs_files(phb-bus); + return 0; } diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 7f94f76..4d3de7e 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -1480,11 +1480,14 @@ void pcibios_finish_adding_to_bus(struct pci_bus *bus) pcibios_allocate_bus_resources(bus); pcibios_claim_one_bus(bus); + /* Fixup EEH */ + eeh_add_device_tree_late(bus); + /* Add new devices to global lists. Register in proc, sysfs. */ pci_bus_add_devices(bus); - /* Fixup EEH */ - eeh_add_device_tree_late(bus); + /* sysfs files should only be added after devices are added */ + eeh_add_sysfs_files(bus); } EXPORT_SYMBOL_GPL(pcibios_finish_adding_to_bus); diff --git a/arch/powerpc/platforms/pseries/eeh.c b/arch/powerpc/platforms/pseries/eeh.c index 9a04322..6b73d6c 100644 --- a/arch/powerpc/platforms/pseries/eeh.c +++ b/arch/powerpc/platforms/pseries/eeh.c @@ -788,7 +788,6 @@ static void eeh_add_device_late(struct pci_dev *dev) dev-dev.archdata.edev = edev; eeh_addr_cache_insert_dev(dev); - eeh_sysfs_add_device(dev); } /** @@ -815,6 +814,29 @@ void eeh_add_device_tree_late(struct pci_bus *bus) EXPORT_SYMBOL_GPL(eeh_add_device_tree_late); /** + * eeh_add_sysfs_files - Add EEH sysfs files for the indicated PCI bus + * @bus: PCI bus + * + * This routine must be used to add EEH sysfs files for PCI + * devices which are attached to the indicated PCI bus. The PCI bus + * is added after system boot through hotplug or dlpar. + */ +void eeh_add_sysfs_files(struct pci_bus *bus) +{ + struct pci_dev *dev; + + list_for_each_entry(dev, bus-devices, bus_list) { + eeh_sysfs_add_device(dev); + if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) { + struct pci_bus *subbus = dev-subordinate; + if (subbus) + eeh_add_sysfs_files(subbus); + } + } +} +EXPORT_SYMBOL_GPL(eeh_add_sysfs_files); + +/** * eeh_remove_device - Undo EEH setup for the indicated pci device * @dev: pci device to be removed * @purge_pe: remove the PE or not -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] ppc/iommu: prevent false TCE leak message
When a device DMA window includes the address 0, it's reserved in the TCE bitmap to avoid returning that address to drivers. When the device is removed, the bitmap is checked for any mappings not removed by the driver, indicating a possible DMA mapping leak. Since the reserved address is not cleared, a message is printed, warning of such a leak. Check for the reservation, and clear it before checking for any other standing mappings. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/kernel/iommu.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 8226c6c..226e9e5 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -717,6 +717,11 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) return; } + /* In case we have reserved the first bit, we should not emit +* the warning below. */ + if (tbl-it_offset == 0) + clear_bit(0, tbl-it_map); + /* verify that table contains no entries */ /* it_size is in entries, and we're examining 64 at a time */ for (i = 0; i (tbl-it_size/64); i++) { -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] ppc/EEH: fix crash when adding a device in a slot with DDW
The DDW code uses a eeh_dev struct from the pci_dev. However, this is not set until eeh_add_device_late is called. Since pci_bus_add_devices is called before eeh_add_device_late, the PCI devices are added to the bus, making drivers' probe hooks to be called. These will call set_dma_mask, which will call the DDW code, which will require the eeh_dev struct from pci_dev. This would result in a crash, due to a NULL dereference. Calling eeh_add_device_late after pci_bus_add_devices would make the system BUG, because device files shouldn't be added to devices there were not added to the system. So, a new function is needed to add such files only after pci_bus_add_devices have been called. Cc: sta...@vger.kernel.org Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/include/asm/eeh.h |3 +++ arch/powerpc/kernel/pci-common.c |7 +-- arch/powerpc/platforms/pseries/eeh.c | 24 +++- 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index b0ef738..71aac19 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -201,6 +201,7 @@ int eeh_dev_check_failure(struct eeh_dev *edev); void __init eeh_addr_cache_build(void); void eeh_add_device_tree_early(struct device_node *); void eeh_add_device_tree_late(struct pci_bus *); +void eeh_add_device_tree_files(struct pci_bus *); void eeh_remove_bus_device(struct pci_dev *, int); /** @@ -240,6 +241,8 @@ static inline void eeh_add_device_tree_early(struct device_node *dn) { } static inline void eeh_add_device_tree_late(struct pci_bus *bus) { } +static inline void eeh_add_device_tree_files(struct pci_bus *bus) { } + static inline void eeh_remove_bus_device(struct pci_dev *dev, int purge_pe) { } static inline void eeh_lock(void) { } diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 7f94f76..7b1f14c 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -1480,11 +1480,14 @@ void pcibios_finish_adding_to_bus(struct pci_bus *bus) pcibios_allocate_bus_resources(bus); pcibios_claim_one_bus(bus); + /* Fixup EEH */ + eeh_add_device_tree_late(bus); + /* Add new devices to global lists. Register in proc, sysfs. */ pci_bus_add_devices(bus); - /* Fixup EEH */ - eeh_add_device_tree_late(bus); + /* Add EEH sysfs files */ + eeh_add_device_tree_files(bus); } EXPORT_SYMBOL_GPL(pcibios_finish_adding_to_bus); diff --git a/arch/powerpc/platforms/pseries/eeh.c b/arch/powerpc/platforms/pseries/eeh.c index 9a04322..a667a34 100644 --- a/arch/powerpc/platforms/pseries/eeh.c +++ b/arch/powerpc/platforms/pseries/eeh.c @@ -788,7 +788,6 @@ static void eeh_add_device_late(struct pci_dev *dev) dev-dev.archdata.edev = edev; eeh_addr_cache_insert_dev(dev); - eeh_sysfs_add_device(dev); } /** @@ -815,6 +814,29 @@ void eeh_add_device_tree_late(struct pci_bus *bus) EXPORT_SYMBOL_GPL(eeh_add_device_tree_late); /** + * eeh_add_device_tree_files - Add EEH sysfs files for the indicated PCI bus + * @bus: PCI bus + * + * This routine must be used to add EEH sysfs files for PCI + * devices which are attached to the indicated PCI bus. The PCI bus + * is added after system boot through hotplug or dlpar. + */ +void eeh_add_device_tree_files(struct pci_bus *bus) +{ + struct pci_dev *dev; + + list_for_each_entry(dev, bus-devices, bus_list) { + eeh_sysfs_add_device(dev); + if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) { + struct pci_bus *subbus = dev-subordinate; + if (subbus) + eeh_add_device_tree_files(subbus); + } + } +} +EXPORT_SYMBOL_GPL(eeh_add_device_tree_files); + +/** * eeh_remove_device - Undo EEH setup for the indicated pci device * @dev: pci device to be removed * @purge_pe: remove the PE or not -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mlx4_en: map entire pages to increase throughput
On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote: On 07/16/2012 10:01 AM, Thadeu Lima de Souza Cascardo wrote: In its receive path, mlx4_en driver maps each page chunk that it pushes to the hardware and unmaps it when pushing it up the stack. This limits throughput to about 3Gbps on a Power7 8-core machine. That seems rather extraordinarily low - Power7 is supposed to be a rather high performance CPU. The last time I noticed O(3Gbit/s) on 10G for bulk transfer was before the advent of LRO/GRO - that was in the x86 space though. Is mapping really that expensive with Power7? Copying linuxppc-dev and Anton here. But I can tell you that we have lock contention when doing the mapping on the same adapter (map table per device). Anton has sent some patches that improves that *a lot*. However, for 1500 MTU, mlx4_en was doing two unmaps and two maps per packet. The problem is not the CPU power needed to do the mappings, but that we find the lock contention and end up with the CPUs more than 30% of the time spent on spin locking. One solution is to map the entire allocated page at once. However, this requires that we keep track of every page fragment we give to a descriptor. We also need to work with the discipline that all fragments will be released (in the sense that it will not be reused by the driver anymore) in the order they are allocated to the driver. This requires that we don't reuse any fragments, every single one of them must be reallocated. We do that by releasing all the fragments that are processed and only after finished processing the descriptors, we start the refill. We also must somehow guarantee that we either refill all fragments in a descriptor or none at all, without resorting to giving up a page fragment that we would have already given. Otherwise, we would break the discipline of only releasing the fragments in the order they were allocated. This has passed page allocation fault injections (restricted to the driver by using required-start and required-end) and device hotplug while 16 TCP streams were able to deliver more than 9Gbps. What is the effect on packet-per-second performance? (eg aggregate, burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR) I used uperf with TCP_NODELAY and 16 threads sending from another machine 64000-sized writes for 60 seconds. I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s (9.23Gb/s) with the patch. Best regards. Cascardo. rick jones -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mlx4_en: map entire pages to increase throughput
On Mon, Jul 16, 2012 at 12:42:41PM -0700, Rick Jones wrote: On 07/16/2012 12:06 PM, Thadeu Lima de Souza Cascardo wrote: On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote: What is the effect on packet-per-second performance? (eg aggregate, burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR) I used uperf with TCP_NODELAY and 16 threads sending from another machine 64000-sized writes for 60 seconds. I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s (9.23Gb/s) with the patch. I was thinking more along the lines of an additional comparison, explicitly using netperf TCP_RR or something like it, not just the packets per second from a bulk transfer test. rick -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html I used a uperf profile that is similar to TCP_RR. It writes, then reads some bytes. I kept the TCP_NODELAY flag. Without the patch, I saw the following: packet size ops/s Gb/s 1 337024 0.0027 90 276620 0.199 900 190455 1.37 400068863 2.20 900045638 3.29 6 94094.52 With the patch: packet size ops/s Gb/s 1 451738 0.0036 90 345682 0.248 900 272258 1.96 4000127055 4.07 9000106614 7.68 6 30671 14.72 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mlx4_en: map entire pages to increase throughput
On Mon, Jul 16, 2012 at 11:43:33PM +0300, Or Gerlitz wrote: On Mon, Jul 16, 2012 at 10:42 PM, Rick Jones rick.jon...@hp.com wrote: I was thinking more along the lines of an additional comparison, explicitly using netperf TCP_RR or something like it, not just the packets per second from a bulk transfer test. TCP_STREAM from this setup before the patch would be good to know as well Hi, Or. Does the stream test that I did with uperf using messages of 64000 bytes fit? TCP_NODELAY does not make a difference in this case. I get something around 3Gbps before the patch and something around 9Gbps after the patch. Before the patch: # ./uperf-1.0.3-beta/src/uperf -m tcp.xml Starting 16 threads running profile:tcp_stream ... 0.00 seconds Txn1 0 /1.00(s) =0 16op/s Txn220.81GB /59.26(s) = 3.02Gb/s5914op/s Txn3 0 /0.00(s) =0 128295op/s --- Total 20.81GB /61.37(s) = 2.91Gb/s5712op/s Netstat statistics for this run --- Nic opkts/s ipkts/s obits/s ibits/s eth6 252459 31694 3.06Gb/s 16.74Mb/s eth02 18 3.87Kb/s 14.28Kb/s --- Run Statistics Hostname TimeData Throughput Operations Errors --- 10.0.0.2 61.47s 20.81GB 2.91Gb/s 350528 0.00 master 61.37s 20.81GB 2.91Gb/s 350528 0.00 --- Difference(%) -0.16% 0.00%0.16%0.00% 0.00% After the patch: # ./uperf-1.0.3-beta/src/uperf -m tcp.xml Starting 16 threads running profile:tcp_stream ... 0.00 seconds Txn1 0 /1.00(s) =0 16op/s Txn264.50GB /60.27(s) = 9.19Gb/s 17975op/s Txn3 0 /0.00(s) =0 --- Total 64.50GB /62.27(s) = 8.90Gb/s 17397op/s Netstat statistics for this run --- Nic opkts/s ipkts/s obits/s ibits/s eth6 769428 96018 9.31Gb/s 50.72Mb/s eth01 15 2.48Kb/s 13.59Kb/s --- Run Statistics Hostname TimeData Throughput Operations Errors --- 10.0.0.2 62.27s 64.36GB 8.88Gb/s 1081096 0.00 master 62.27s 64.50GB 8.90Gb/s 1083325 0.00 --- Difference(%) -0.00% 0.21%0.21%0.21% 0.00% Profile tcp.xml: ?xml version=1.0? profile name=TCP_STREAM group nthreads=16 transaction iterations=1 flowop type=connect options=remotehost=10.0.0.2 protocol=tcp tcp_nodelay/ /transaction transaction duration=60 flowop type=write options=count=160 size=64000/ /transaction transaction iterations=1 flowop type=disconnect / /transaction /group /profile ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] ppc/eeh: fix crash when error happens during device probe
EEH may happen during a PCI driver probe. If the driver is trying to access some register in a loop, the EEH code will try to print the driver name. But the driver pointer in struct pci_dev is not set until probe returns successfully. Use a function to test if the device and the driver pointer is NULL before accessing the driver's name. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/include/asm/ppc-pci.h |5 + arch/powerpc/platforms/pseries/eeh.c |4 ++-- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h index 43268f1..6d42297 100644 --- a/arch/powerpc/include/asm/ppc-pci.h +++ b/arch/powerpc/include/asm/ppc-pci.h @@ -142,6 +142,11 @@ static inline const char *eeh_pci_name(struct pci_dev *pdev) return pdev ? pci_name(pdev) : null; } +static inline const char *eeh_driver_name(struct pci_dev *pdev) +{ + return (pdev pdev-driver) ? pdev-driver-name : null; +} + #endif /* CONFIG_EEH */ #else /* CONFIG_PCI */ diff --git a/arch/powerpc/platforms/pseries/eeh.c b/arch/powerpc/platforms/pseries/eeh.c index 5658690..c0b40af 100644 --- a/arch/powerpc/platforms/pseries/eeh.c +++ b/arch/powerpc/platforms/pseries/eeh.c @@ -551,9 +551,9 @@ int eeh_dn_check_failure(struct device_node *dn, struct pci_dev *dev) printk (KERN_ERR EEH: %d reads ignored for recovering device at location=%s driver=%s pci addr=%s\n, pdn-eeh_check_count, location, - dev-driver-name, eeh_pci_name(dev)); + eeh_driver_name(dev), eeh_pci_name(dev)); printk (KERN_ERR EEH: Might be infinite loop in %s driver\n, - dev-driver-name); + eeh_driver_name(dev)); dump_stack(); } goto dn_unlock; -- 1.7.4.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: cpuidle: Default y for pseries
On Tue, Jan 10, 2012 at 03:05:35PM -, Benjamin Herrenschmidt wrote: We just replaced the pseries platform idle loops with a cpuidle backend, however that means that you won't get any power saving and won't return any unused idle time to the hypervisor unless cpuidle is enabled. Thus is should default to y when pseries is enabled. I prefer that to a select so we can still make it modular if we want to. Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org --- Linus, do you want to just pick that up or should I put it into powerpc.git and ask you to pull ? I will have 2 or 3 other fixes there later today, but I wanted to make sure you were ok with the approach with this specific one. Hi, Ben. Note that building with CONFIG_PSERIES_IDLE=m fails. CC [M] arch/powerpc/platforms/pseries/processor_idle.o arch/powerpc/platforms/pseries/processor_idle.c:35: error: redefinition of ‘update_smt_snooze_delay’ /root/linux/arch/powerpc/include/asm/system.h:230: note: previous definition of ‘update_smt_snooze_delay’ was here arch/powerpc/platforms/pseries/processor_idle.c:175: error: redefinition of ‘pseries_notify_cpuidle_add_cpu’ /root/linux/arch/powerpc/include/asm/system.h:231: note: previous definition of ‘pseries_notify_cpuidle_add_cpu’ was here make[2]: *** [arch/powerpc/platforms/pseries/processor_idle.o] Error 1 make[1]: *** [arch/powerpc/platforms/pseries] Error 2 make: *** [arch/powerpc/platforms] Error 2 asm/system.h has empty inline implementations for update_smt_snooze_delay and pseries_notify_cpuidle_add_cpu, which are used when CONFIG_PSERIES_IDLE is undefined. Since those two functions are used in core power architecture functions (store_smt_snooze_delay at kernel/sysfs.c and smp_xics_setup_cpu at platforms/pseries/smp.c), this requires some rework in these interactions or we should simply disable PSERIES_IDLE to be built as a module for now. Regards. Cascardo. diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig index 7dbc4a8..62ca70d 100644 --- a/drivers/cpuidle/Kconfig +++ b/drivers/cpuidle/Kconfig @@ -1,7 +1,8 @@ config CPU_IDLE bool CPU idle PM support - default ACPI + default y if ACPI + default y if PPC_PSERIES help CPU idle is a generic framework for supporting software-controlled idle processor power management. It includes modular cross-platform ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] mlx4_en: fix endianness with blue frame support
The doorbell register was being unconditionally swapped. In x86, that meant it was being swapped to BE and written to the descriptor and to memory, depending on the case of blue frame support or writing to doorbell register. On PPC, this meant it was being swapped to LE and then swapped back to BE while writing to the register. But in the blue frame case, it was being written as LE to the descriptor. The fix is not to swap doorbell unconditionally, write it to the register as BE and convert it to BE when writing it to the descriptor. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com Reported-by: Richard Hendrickson richh...@us.ibm.com Cc: Eli Cohen e...@dev.mellanox.co.il Cc: Yevgeny Petrilin yevge...@mellanox.co.il Cc: Benjamin Herrenschmidt b...@kernel.crashing.org --- drivers/net/mlx4/en_tx.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c index 6e03de0..f76ab6b 100644 --- a/drivers/net/mlx4/en_tx.c +++ b/drivers/net/mlx4/en_tx.c @@ -172,7 +172,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv, memset(ring-buf, 0, ring-buf_size); ring-qp_state = MLX4_QP_STATE_RST; - ring-doorbell_qpn = swab32(ring-qp.qpn 8); + ring-doorbell_qpn = ring-qp.qpn 8; mlx4_en_fill_qp_context(priv, ring-size, ring-stride, 1, 0, ring-qpn, ring-cqn, ring-context); @@ -791,7 +791,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev) skb_orphan(skb); if (ring-bf_enabled desc_size = MAX_BF !bounce !vlan_tag) { - *(u32 *) (tx_desc-ctrl.vlan_tag) |= ring-doorbell_qpn; + *(__be32 *) (tx_desc-ctrl.vlan_tag) |= cpu_to_be32(ring-doorbell_qpn); op_own |= htonl((bf_index 0x) 8); /* Ensure new descirptor hits memory * before setting ownership of this descriptor to HW */ @@ -812,7 +812,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev) wmb(); tx_desc-ctrl.owner_opcode = op_own; wmb(); - writel(ring-doorbell_qpn, ring-bf.uar-map + MLX4_SEND_DOORBELL); + iowrite32be(ring-doorbell_qpn, ring-bf.uar-map + MLX4_SEND_DOORBELL); } /* Poll CQ here */ -- 1.7.4.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mlx4_en: fix endianness with blue frame support
On Mon, Oct 10, 2011 at 01:42:23PM -0300, Thadeu Lima de Souza Cascardo wrote: The doorbell register was being unconditionally swapped. In x86, that meant it was being swapped to BE and written to the descriptor and to memory, depending on the case of blue frame support or writing to doorbell register. On PPC, this meant it was being swapped to LE and then swapped back to BE while writing to the register. But in the blue frame case, it was being written as LE to the descriptor. The fix is not to swap doorbell unconditionally, write it to the register as BE and convert it to BE when writing it to the descriptor. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com Reported-by: Richard Hendrickson richh...@us.ibm.com Cc: Eli Cohen e...@dev.mellanox.co.il Cc: Yevgeny Petrilin yevge...@mellanox.co.il Cc: Benjamin Herrenschmidt b...@kernel.crashing.org --- So I tested this patch and it works for me. Thanks Ben and Eli for finding out the problem with doorbell in the descriptor. Regards, Cascardo. drivers/net/mlx4/en_tx.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c index 6e03de0..f76ab6b 100644 --- a/drivers/net/mlx4/en_tx.c +++ b/drivers/net/mlx4/en_tx.c @@ -172,7 +172,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv, memset(ring-buf, 0, ring-buf_size); ring-qp_state = MLX4_QP_STATE_RST; - ring-doorbell_qpn = swab32(ring-qp.qpn 8); + ring-doorbell_qpn = ring-qp.qpn 8; mlx4_en_fill_qp_context(priv, ring-size, ring-stride, 1, 0, ring-qpn, ring-cqn, ring-context); @@ -791,7 +791,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev) skb_orphan(skb); if (ring-bf_enabled desc_size = MAX_BF !bounce !vlan_tag) { - *(u32 *) (tx_desc-ctrl.vlan_tag) |= ring-doorbell_qpn; + *(__be32 *) (tx_desc-ctrl.vlan_tag) |= cpu_to_be32(ring-doorbell_qpn); op_own |= htonl((bf_index 0x) 8); /* Ensure new descirptor hits memory * before setting ownership of this descriptor to HW */ @@ -812,7 +812,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev) wmb(); tx_desc-ctrl.owner_opcode = op_own; wmb(); - writel(ring-doorbell_qpn, ring-bf.uar-map + MLX4_SEND_DOORBELL); + iowrite32be(ring-doorbell_qpn, ring-bf.uar-map + MLX4_SEND_DOORBELL); } /* Poll CQ here */ -- 1.7.4.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mlx4_en: fix transmit of packages when blue frame isenabled
On Thu, Oct 06, 2011 at 03:10:28PM +0100, David Laight wrote: static void mlx4_bf_copy(unsigned long *dst, unsigned long *src, unsigned bytecnt) { + int i; + __le32 *psrc = (__le32 *)src; + + /* +* the buffer is already in big endian. For little endian machines that's +* fine. For big endain machines we must swap since the chipset swaps again +*/ + for (i = 0; i bytecnt / 4; ++i) + psrc[i] = le32_to_cpu(psrc[i]); + __iowrite64_copy(dst, src, bytecnt / 8); } That code looks horrid... 1) I'm not sure the caller expects the buffer to be corrupted. 2) It contains a lot of memory cycles. 3) It looked from the calls that this code is copying descriptors, so the transfer length is probably 1 or 2 words - so the loop is inefficient. 4) ppc doesn't have a fast byteswap instruction (very new gcc might use the byteswapping memery access for the le32_to_cpu() though), so it would be better getting the byteswap done inside __iowrite64_copy() - since that is probably requesting a byteswap anyway. OTOH I'm not at all clear about the 64bit xfers Plus, it doesn't work. :-\ After doing some more investigation, I decided to add the ARP entries manually and test for different packet sizes. Packets larger than 256 should use the non-BF path and work. Smaller packets should fail. To my surprise, the smaller packets were fine. Tried many different sizes and they all succeeded. Then, tried using broadcast and it worked too (after setting the proper sysctl). Broadcast with 0, 8 and 16 ping size, all OK. Removed the ARP entries, the ARP responses did not get to the other side. So, I did a pretty nasty hack to not use BF for ARP packets and after cleaning up the ARP table, everything seems to be working fine. What follows is my patch, just for reference. Regards, Cascardo. --- --- drivers/net/mlx4/en_tx.c.orig 2011-10-03 15:42:15.736104623 -0500 +++ drivers/net/mlx4/en_tx.c2011-10-07 17:12:38.023792483 -0500 @@ -627,6 +627,7 @@ int lso_header_size; void *fragptr; bool bounce = false; + bool is_arp = false; if (!priv-port_up) goto tx_drop; @@ -698,10 +699,11 @@ priv-port_stats.tx_chksum_offload++; } + skb_reset_mac_header(skb); + ethh = eth_hdr(skb); + is_arp = ethh ethh-h_proto == ETH_P_ARP; if (unlikely(priv-validate_loopback)) { /* Copy dst mac address to wqe */ - skb_reset_mac_header(skb); - ethh = eth_hdr(skb); if (ethh ethh-h_dest) { mac = mlx4_en_mac_to_u64(ethh-h_dest); mac_h = (u32) ((mac 0xULL) 16); @@ -790,7 +792,7 @@ if (likely(!skb_shared(skb))) skb_orphan(skb); - if (ring-bf_enabled desc_size = MAX_BF !bounce !vlan_tag) { + if (ring-bf_enabled desc_size = MAX_BF !bounce !vlan_tag !is_arp) { *(u32 *) (tx_desc-ctrl.vlan_tag) |= ring-doorbell_qpn; op_own |= htonl((bf_index 0x) 8); /* Ensure new descirptor hits memory --- ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
On Tue, Oct 04, 2011 at 08:02:12AM +0200, Benjamin Herrenschmidt wrote: On Mon, 2011-10-03 at 17:53 -0300, Thadeu Lima de Souza Cascardo wrote: .../... Can you also send me the output of ethtool -i? It seems that there is a problem with write combining on Power processors, we will check this issue. Yevgeny Hello, Yevgeny. You will find the output of ethtool -i below. I am copying Ben and powerpc list, in case this is an issue with Power processors. They can provide us some more insight into this. May I get some background please ? :-) I'm not aware of a specific issue with write combining but I'd need to know more about what you are doing and the code to do it to comment on whether it should work or not. Cheers, Ben. Hello, Ben. Sorry for that. I am testing mlx4_en driver on a POWER. Yevgeny has added blue frame support, that does not require writing to the device memory to indicate a new packet (the doorbell register as it is called). Well, the ring is getting full with no interrupt or packets transmitted. I simply added a write to the doorbell register and it works for me. Yevgeny says this is not the right fix, claiming there is a problem with write combining on POWER. The code uses memory barriers, so I don't know why there is any problem. I am posting the code here to show better what the situation is. Yevgeny can tell more about the device and the driver. The code below is the driver as of now, including a diff with what I changed and had resulted OK for me. Before the blue frame support, the only code executed was the else part. I can't tell much what the device should be seeing and doing after the blue frame part of the code is executed. But it does send the packet if I write to the doorbell register. Yevgeny, can you tell us what the device should be doing and why you think this is a problem on POWER? Is it possible that this is simply a problem with the firmware version? Thanks, Cascardo. --- if (ring-bf_enabled desc_size = MAX_BF !bounce !vlan_tag) { *(u32 *) (tx_desc-ctrl.vlan_tag) |= ring-doorbell_qpn; op_own |= htonl((bf_index 0x) 8); /* Ensure new descirptor hits memory * before setting ownership of this descriptor to HW */ wmb(); tx_desc-ctrl.owner_opcode = op_own; wmb(); mlx4_bf_copy(ring-bf.reg + ring-bf.offset, (unsigned long *) tx_desc-ctrl, desc_size); wmb(); ring-bf.offset ^= ring-bf.buf_size; } else { /* Ensure new descirptor hits memory * before setting ownership of this descriptor to HW */ wmb(); tx_desc-ctrl.owner_opcode = op_own; - wmb(); - writel(ring-doorbell_qpn, ring-bf.uar-map + MLX4_SEND_DOORBELL); } + wmb(); + writel(ring-doorbell_qpn, ring-bf.uar-map + MLX4_SEND_DOORBELL); + --- ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
On Mon, Oct 03, 2011 at 02:56:08PM +, Yevgeny Petrilin wrote: Hello, Yevgeny. We use a MT26448 (lspci -v output bellow) on a POWER7. Any other information, tests or debug patches you want me to try, just tell me. I expected this was really not the proper fix, but thought it would be better to send it than just guess. Any clues what the problem might be? Perhaps, we would have to disable blue frame for this particular device? Regards, Cascardo. --- lspci output 0006:01:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0) --- lspci -v output 0006:01:00.0 0200: 15b3:6750 (rev b0) Subsystem: 1014:0416 Flags: bus master, fast devsel, latency 0, IRQ 31 Memory at 3da47be0 (64-bit, non-prefetchable) [size=1M] Memory at 3da47c00 (64-bit, prefetchable) [size=32M] Expansion ROM at 3da47bf0 [disabled] [size=1M] Capabilities: [40] Power Management version 3 Capabilities: [48] Vital Product Data Capabilities: [9c] MSI-X: Enable+ Count=128 Masked- Capabilities: [60] Express Endpoint, MSI 00 Capabilities: [100] Alternative Routing-ID Interpretation (ARI) Capabilities: [148] Device Serial Number 00-02-c9-03-00-0f-50-be Kernel driver in use: mlx4_core Kernel modules: mlx4_en, mlx4_core --- Cascardo, Can you also send me the output of ethtool -i? It seems that there is a problem with write combining on Power processors, we will check this issue. Yevgeny Hello, Yevgeny. You will find the output of ethtool -i below. I am copying Ben and powerpc list, in case this is an issue with Power processors. They can provide us some more insight into this. Thanks, Cascardo. --- # ethtool -i eth10 driver: mlx4_en version: 1.5.4.1 (March 2011) firmware-version: 2.9.4130 bus-info: 0006:01:00.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc: reserve iommu page 0
Some devices have a dma-window that starts at the address 0. This allows DMA addresses to be mapped to this address and returned to drivers as a valid DMA address. Some drivers may not behave well in this case, since the address 0 is considered an error or not allocated. The solution to avoid this kind of error from happening is reserve the page addressed as 0 so it cannot be allocated for a DMA mapping. Ben Herrenschmidt deserves the credit for this patch. He pointed out the solution and what code would do the job. v2: Add a comment in the code. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: linuxppc-dev@lists.ozlabs.org --- Resending, since it was recommended to add a comment about this particular code. --- arch/powerpc/kernel/iommu.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 961bb03..8395301 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -501,6 +501,14 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid) tbl-it_map = page_address(page); memset(tbl-it_map, 0, sz); + /* +* Reserve page 0 so it will not be used for any mappings. +* This avoids buggy drivers that consider page 0 to be invalid +* to crash the machine or even lose data. + */ + if (tbl-it_offset == 0) + set_bit(0, tbl-it_map); + tbl-it_hint = 0; tbl-it_largehint = tbl-it_halfpoint; spin_lock_init(tbl-it_lock); -- 1.7.4.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: reserve iommu page 0
Some devices have a dma-window that starts at the address 0. This allows DMA addresses to be mapped to this address and returned to drivers as a valid DMA address. Some drivers may not behave well in this case, since the address 0 is considered an error or not allocated. The solution to avoid this kind of error from happening is reserve the page addressed as 0 so it cannot be allocated for a DMA mapping. Ben Herrenschmidt deserves the credit for this patch. He pointed out the solution and what code would do the job. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/iommu.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 961bb03..53411bc 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -501,6 +501,9 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid) tbl-it_map = page_address(page); memset(tbl-it_map, 0, sz); + if (tbl-it_offset == 0) + set_bit(0, tbl-it_map); + tbl-it_hint = 0; tbl-it_largehint = tbl-it_halfpoint; spin_lock_init(tbl-it_lock); -- 1.7.4.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/eeh: fix /proc/ppc64/eeh creation
Since commit 188917e183cf9ad0374b571006d0fc6d48a7f447, /proc/ppc64 is a symlink to /proc/powerpc/. That means that creating /proc/ppc64/eeh will end up with a unaccessible file, that is not listed under /proc/powerpc/ and, then, not listed under /proc/ppc64/. Creating /proc/powerpc/eeh fixes that problem and maintain the compatibility intended with the ppc64 symlink. Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com --- arch/powerpc/platforms/pseries/eeh.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/platforms/pseries/eeh.c b/arch/powerpc/platforms/pseries/eeh.c index ada6e07..d42f37d 100644 --- a/arch/powerpc/platforms/pseries/eeh.c +++ b/arch/powerpc/platforms/pseries/eeh.c @@ -1338,7 +1338,7 @@ static const struct file_operations proc_eeh_operations = { static int __init eeh_init_proc(void) { if (machine_is(pseries)) - proc_create(ppc64/eeh, 0, NULL, proc_eeh_operations); + proc_create(powerpc/eeh, 0, NULL, proc_eeh_operations); return 0; } __initcall(eeh_init_proc); -- 1.7.4.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] trivial: fix typo s/leve/level/ in powerpc arch
Signed-off-by: Thadeu Lima de Souza Cascardo casca...@holoscopio.com --- arch/powerpc/mm/tlb_low_64e.S |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S index f288279..8b04c54 100644 --- a/arch/powerpc/mm/tlb_low_64e.S +++ b/arch/powerpc/mm/tlb_low_64e.S @@ -1,5 +1,5 @@ /* - * Low leve TLB miss handlers for Book3E + * Low level TLB miss handlers for Book3E * * Copyright (C) 2008-2009 * Ben. Herrenschmidt (b...@kernel.crashing.org), IBM Corp. -- 1.6.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev