Re: [PATCH] scripts/faddr2line: Fix regression in name resolution on ppc64le

2022-09-27 Thread Thadeu Lima de Souza Cascardo
On Tue, Sep 27, 2022 at 01:22:11PM +0530, Srikar Dronamraju wrote:
> Commit 1d1a0e7c5100 ("scripts/faddr2line: Fix overlapping text section 
> failures")
> can cause scripts/faddr2line to fail on ppc64le machines on few
> distributions, while working on other distributions. The failure can be
> attributed to difference in readelf output on various distributions.
> 
> $ ./scripts/faddr2line vmlinux find_busiest_group+0x00
> no match for find_busiest_group+0x00
> 
> Expected output was:
> $ ./scripts/faddr2line vmlinux find_busiest_group+0x00
> find_busiest_group+0x00/0x3d0:
> find_busiest_group at kernel/sched/fair.c:9595
> 
> On ppc64le, readelf adds localentry tag before the symbol name on few
> distributions and adds the localentry tag after the symbol name on few
> other distributions. This problem has been discussed in the reference
> URL given below. This problem can be overcome by filtering out
> localentry tags in readelf output. Similar fixes are already present in
> kernel by way of commits:
> 
> 1fd6cee127e2 ("libbpf: Fix VERSIONED_SYM_COUNT number parsing")
> aa915931ac3e ("libbpf: Fix readelf output parsing for Fedora")
> 
> Fixes: 1d1a0e7c5100 ("scripts/faddr2line: Fix overlapping text section 
> failures")
> Reference: https://lore.kernel.org/bpf/20191211160133.GB4580@calabresa/
> Cc: "Naveen N. Rao" 
> Cc: Jiri Olsa 
> Cc: Thadeu Lima de Souza Cascardo 

Reviewed-by: Thadeu Lima de Souza Cascardo 

The other instances of readelf --wide on faddr2line use --section-headers and
should not required the same mangling.

Cascardo.

> Cc: Josh Poimboeuf 
> Cc: Peter Zijlstra (Intel) 
> Cc: linuxppc-dev 
> Cc: LKML 
> Signed-off-by: Srikar Dronamraju 
> ---
>  scripts/faddr2line | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/scripts/faddr2line b/scripts/faddr2line
> index 5514c23f45c2..0e73aca4f908 100755
> --- a/scripts/faddr2line
> +++ b/scripts/faddr2line
> @@ -74,7 +74,8 @@ command -v ${ADDR2LINE} >/dev/null 2>&1 || die 
> "${ADDR2LINE} isn't installed"
>  find_dir_prefix() {
>   local objfile=$1
>  
> - local start_kernel_addr=$(${READELF} --symbols --wide $objfile | ${AWK} 
> '$8 == "start_kernel" {printf "0x%s", $2}')
> + local start_kernel_addr=$(${READELF} --symbols --wide $objfile | sed 
> 's/\[.*\]//' |
> + ${AWK} '$8 == "start_kernel" {printf "0x%s", $2}')
>   [[ -z $start_kernel_addr ]] && return
>  
>   local file_line=$(${ADDR2LINE} -e $objfile $start_kernel_addr)
> @@ -178,7 +179,7 @@ __faddr2line() {
>   found=2
>   break
>   fi
> - done < <(${READELF} --symbols --wide $objfile | ${AWK} -v 
> sec=$sym_sec '$7 == sec' | sort --key=2)
> + done < <(${READELF} --symbols --wide $objfile | sed 
> 's/\[.*\]//' | ${AWK} -v sec=$sym_sec '$7 == sec' | sort --key=2)
>  
>   if [[ $found = 0 ]]; then
>   warn "can't find symbol: sym_name: $sym_name sym_sec: 
> $sym_sec sym_addr: $sym_addr sym_elf_size: $sym_elf_size"
> @@ -259,7 +260,7 @@ __faddr2line() {
>  
>   DONE=1
>  
> - done < <(${READELF} --symbols --wide $objfile | ${AWK} -v fn=$sym_name 
> '$4 == "FUNC" && $8 == fn')
> + done < <(${READELF} --symbols --wide $objfile | sed 's/\[.*\]//' | 
> ${AWK} -v fn=$sym_name '$4 == "FUNC" && $8 == fn')
>  }
>  
>  [[ $# -lt 2 ]] && usage
> 
> base-commit: bf682942cd26ce9cd5e87f73ae099b383041e782
> -- 
> 2.31.1
> 


[PATCH] selftests/powerpc/spectre_v2: Return skip code when miss_percent is high

2021-12-07 Thread Thadeu Lima de Souza Cascardo
A mis-match between reported and actual mitigation is not restricted to the
Vulnerable case. The guest might also report the mitigation as "Software
count cache flush" and the host will still mitigate with branch cache
disabled.

So, instead of skipping depending on the detected mitigation, simply skip
whenever the detected miss_percent is the expected one for a fully
mitigated system, that is, above 95%.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 tools/testing/selftests/powerpc/security/spectre_v2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/security/spectre_v2.c 
b/tools/testing/selftests/powerpc/security/spectre_v2.c
index adc2b7294e5f..83647b8277e7 100644
--- a/tools/testing/selftests/powerpc/security/spectre_v2.c
+++ b/tools/testing/selftests/powerpc/security/spectre_v2.c
@@ -193,7 +193,7 @@ int spectre_v2_test(void)
 * We are not vulnerable and reporting otherwise, so
 * missing such a mismatch is safe.
 */
-   if (state == VULNERABLE)
+   if (miss_percent > 95)
return 4;
 
return 1;
-- 
2.32.0



coherency issue observed after hotplug on POWER8

2021-09-21 Thread Thadeu Lima de Souza Cascardo
Hi, there.

We have been investigating an issue we have observed on POWER8 POWERNV systems.
When running the kernel selftests reuseport_bpf_cpu after a CPU hotplug, we see
crashes, in different forms. [1]

I managed to get xmon on that trap, and did some debugging. [2] I tried to dump
the BPF JIT code, and it looks different when dumped from CPU#0 and CPU#0x9f
(the one that was hotplugged, offlined, then onlined).

Here is my partial analysis [3]. Basically, the BPF JIT fills a page with
invalid instructions (traps, in ppc64 case), and puts the BPF program in a
random offset of the page. In the case of the hotplugged CPU, which was the one
that compiled the program, the page had the expected contents (BPF program
started at the offset used to run the program). On the other CPU (in many
cases, CPU #0), the same memory address/page had different contents, with the
program starting at a different offset.

Is this a case of a bug in the micro-architecture or the firmware when doing
the hotplug? Can someone chime in?

Notice that we can't reproduce the same issue on a POWER9 system.

Thanks.
Cascardo.

[1] https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076
[2] https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/comments/29
[3] https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/comments/30


[PATCH] powerpc/perf: prevent mixed EBB and non-EBB events

2021-02-24 Thread Thadeu Lima de Souza Cascardo
EBB events must be under exclusive groups, so there is no mix of EBB and
non-EBB events on the same PMU. This requirement worked fine as perf core
would not allow other pinned events to be scheduled together with exclusive
events.

This assumption was broken by commit 1908dc911792 ("perf: Tweak
perf_event_attr::exclusive semantics").

After that, the test cpu_event_pinned_vs_ebb_test started succeeding after
read_events, but worse, the task would not have given access to PMC1, so
when it tried to write to it, it was killed with "illegal instruction".

Preventing mixed EBB and non-EBB events from being add to the same PMU will
just revert to the previous behavior and the test will succeed.

Fixes: 1908dc911792 (perf: Tweak perf_event_attr::exclusive semantics)
Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 arch/powerpc/perf/core-book3s.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 43599e671d38..d767f7944f85 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1010,9 +1010,25 @@ static int check_excludes(struct perf_event **ctrs, 
unsigned int cflags[],
  int n_prev, int n_new)
 {
int eu = 0, ek = 0, eh = 0;
+   bool ebb = false;
int i, n, first;
struct perf_event *event;
 
+   n = n_prev + n_new;
+   if (n <= 1)
+   return 0;
+
+   first = 1;
+   for (i = 0; i < n; ++i) {
+   event = ctrs[i];
+   if (first) {
+   ebb = is_ebb_event(event);
+   first = 0;
+   } else if (is_ebb_event(event) != ebb) {
+   return -EAGAIN;
+   }
+   }
+
/*
 * If the PMU we're on supports per event exclude settings then we
 * don't need to do any of this logic. NB. This assumes no PMU has both
@@ -1021,10 +1037,6 @@ static int check_excludes(struct perf_event **ctrs, 
unsigned int cflags[],
if (ppmu->flags & PPMU_ARCH_207S)
return 0;
 
-   n = n_prev + n_new;
-   if (n <= 1)
-   return 0;
-
first = 1;
for (i = 0; i < n; ++i) {
if (cflags[i] & PPMU_LIMITED_PMC_OK) {
-- 
2.27.0



Re: [PATCH] selftests/seccomp: fix ptrace tests on powerpc

2020-09-17 Thread Thadeu Lima de Souza Cascardo
On Thu, Sep 17, 2020 at 03:37:16PM -0700, Kees Cook wrote:
> On Sun, Sep 13, 2020 at 10:34:23PM +1000, Michael Ellerman wrote:
> > Thadeu Lima de Souza Cascardo  writes:
> > > On Tue, Sep 08, 2020 at 04:18:17PM -0700, Kees Cook wrote:
> > >> On Tue, Jun 30, 2020 at 01:47:39PM -0300, Thadeu Lima de Souza Cascardo 
> > >> wrote:
> > ...
> > >> > @@ -1809,10 +1818,15 @@ void tracer_ptrace(struct __test_metadata 
> > >> > *_metadata, pid_t tracee,
> > >> >EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
> > >> >: PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
> > >> >  
> > >> > -  if (!entry)
> > >> > +  if (!entry && !syscall_nr)
> > >> >return;
> > >> >  
> > >> > -  nr = get_syscall(_metadata, tracee);
> > >> > +  if (entry)
> > >> > +  nr = get_syscall(_metadata, tracee);
> > >> > +  else
> > >> > +  nr = *syscall_nr;
> > >> 
> > >> This is weird? Shouldn't get_syscall() be modified to do the right thing
> > >> here instead of depending on the extra arg?
> > >> 
> > >
> > > R0 might be clobered. Same documentation mentions it as volatile. So, 
> > > during
> > > syscall exit, we can't tell for sure that R0 will have the original 
> > > syscall
> > > number. So, we need to grab it during syscall enter, save it somewhere and
> > > reuse it. I used the test context/args for that.
> > 
> > The user r0 (in regs->gpr[0]) shouldn't be clobbered.
> > 
> > But it is modified if the tracer skips the syscall, by setting the
> > syscall number to -1. Or if the tracer changes the syscall number.
> > 
> > So if you need the original syscall number in the exit path then I think
> > you do need to save it at entry.
> 
> ... the selftest code wants to test the updated syscall (-1 or
> whatever), so this sounds correct. Was this test actually failing on
> powerpc? (I'd really rather not split entry/exit if I don't have to.)
> 
> -- 
> Kees Cook

Yes, it started failing when the return code started being changed as well.
Though ptrace can change the syscall at entry (IIRC), it can't change the
return code. And that is part of the ABI. If ppc is changed so it allows
changing the return code during ptrace entry, some strace uses will break. So
that is not an option.

Cascardo.


[PATCH v2] selftests/seccomp: fix ptrace tests on powerpc

2020-09-11 Thread Thadeu Lima de Souza Cascardo
As pointed out by Michael Ellerman, the ptrace ABI on powerpc does not
allow or require the return code to be set on syscall entry when
skipping the syscall. It will always return ENOSYS and the return code
must be set on syscall exit.

This code does that, behaving more similarly to strace. It still sets
the return code on entry, which is overridden on powerpc, and it will
always repeat the same on exit. Also, on powerpc, the errno is not
inverted, and depends on ccr.so being set.

This has been tested on powerpc and amd64.

Cc: Michael Ellerman 
Cc: Kees Cook 
Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 81 ---
 1 file changed, 53 insertions(+), 28 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 7a6d40286a42..0ddc0846e9c0 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -1837,15 +1837,24 @@ void change_syscall(struct __test_metadata *_metadata,
 #endif
 
/* If syscall is skipped, change return value. */
-   if (syscall == -1)
+   if (syscall == -1) {
 #ifdef SYSCALL_NUM_RET_SHARE_REG
TH_LOG("Can't modify syscall return on this architecture");
-
 #elif defined(__xtensa__)
regs.SYSCALL_RET(regs) = result;
+#elif defined(__powerpc__)
+   /* Error is signaled by CR0 SO bit and error code is positive. 
*/
+   if (result < 0) {
+   regs.SYSCALL_RET = -result;
+   regs.ccr |= 0x1000;
+   } else {
+   regs.SYSCALL_RET = result;
+   regs.ccr &= ~0x1000;
+   }
 #else
regs.SYSCALL_RET = result;
 #endif
+   }
 
 #ifdef HAVE_GETREGS
ret = ptrace(PTRACE_SETREGS, tracee, 0, );
@@ -1897,12 +1906,44 @@ void tracer_seccomp(struct __test_metadata *_metadata, 
pid_t tracee,
 
 }
 
+FIXTURE(TRACE_syscall) {
+   struct sock_fprog prog;
+   pid_t tracer, mytid, mypid, parent;
+};
+
+FIXTURE_VARIANT(TRACE_syscall) {
+   /*
+* All of the SECCOMP_RET_TRACE behaviors can be tested with either
+* SECCOMP_RET_TRACE+PTRACE_CONT or plain ptrace()+PTRACE_SYSCALL.
+* This indicates if we should use SECCOMP_RET_TRACE (false), or
+* ptrace (true).
+*/
+   bool use_ptrace;
+
+   /*
+* Some archs (like ppc) only support changing the return code during
+* syscall exit when ptrace is used.  As the syscall number might not
+* be available anymore during syscall exit, it needs to be saved
+* during syscall enter.
+*/
+   int syscall_nr;
+};
+
+FIXTURE_VARIANT_ADD(TRACE_syscall, ptrace) {
+   .use_ptrace = true,
+};
+
+FIXTURE_VARIANT_ADD(TRACE_syscall, seccomp) {
+   .use_ptrace = false,
+};
+
 void tracer_ptrace(struct __test_metadata *_metadata, pid_t tracee,
   int status, void *args)
 {
int ret, nr;
unsigned long msg;
static bool entry;
+   FIXTURE_VARIANT(TRACE_syscall) * variant = args;
 
/*
 * The traditional way to tell PTRACE_SYSCALL entry/exit
@@ -1916,10 +1957,15 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
pid_t tracee,
EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
: PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
 
-   if (!entry)
+   if (!entry && !variant)
return;
 
-   nr = get_syscall(_metadata, tracee);
+   if (entry)
+   nr = get_syscall(_metadata, tracee);
+   else if (variant)
+   nr = variant->syscall_nr;
+   if (variant)
+   variant->syscall_nr = nr;
 
if (nr == __NR_getpid)
change_syscall(_metadata, tracee, __NR_getppid, 0);
@@ -1929,29 +1975,6 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
pid_t tracee,
change_syscall(_metadata, tracee, -1, -ESRCH);
 }
 
-FIXTURE(TRACE_syscall) {
-   struct sock_fprog prog;
-   pid_t tracer, mytid, mypid, parent;
-};
-
-FIXTURE_VARIANT(TRACE_syscall) {
-   /*
-* All of the SECCOMP_RET_TRACE behaviors can be tested with either
-* SECCOMP_RET_TRACE+PTRACE_CONT or plain ptrace()+PTRACE_SYSCALL.
-* This indicates if we should use SECCOMP_RET_TRACE (false), or
-* ptrace (true).
-*/
-   bool use_ptrace;
-};
-
-FIXTURE_VARIANT_ADD(TRACE_syscall, ptrace) {
-   .use_ptrace = true,
-};
-
-FIXTURE_VARIANT_ADD(TRACE_syscall, seccomp) {
-   .use_ptrace = false,
-};
-
 FIXTURE_SETUP(TRACE_syscall)
 {
struct sock_filter filter[] = {
@@ -1992,7 +2015,9 @@ FIXTURE_SETUP(TRACE_syscall)
self->tracer = setup_trace_fixture(_metadata,
   v

Re: [PATCH] selftests/seccomp: fix ptrace tests on powerpc

2020-09-11 Thread Thadeu Lima de Souza Cascardo
On Tue, Sep 08, 2020 at 04:18:17PM -0700, Kees Cook wrote:
> On Tue, Jun 30, 2020 at 01:47:39PM -0300, Thadeu Lima de Souza Cascardo wrote:
> > As pointed out by Michael Ellerman, the ptrace ABI on powerpc does not
> > allow or require the return code to be set on syscall entry when
> > skipping the syscall. It will always return ENOSYS and the return code
> > must be set on syscall exit.
> > 
> > This code does that, behaving more similarly to strace. It still sets
> > the return code on entry, which is overridden on powerpc, and it will
> > always repeat the same on exit. Also, on powerpc, the errno is not
> > inverted, and depends on ccr.so being set.
> > 
> > This has been tested on powerpc and amd64.
> > 
> > Cc: Michael Ellerman 
> > Cc: Kees Cook 
> > Signed-off-by: Thadeu Lima de Souza Cascardo 
> 
> Yikes, I missed this from a while ago. I apologize for responding so
> late!
> 
> This appears still unfixed; is that correct?
> 

Yes. I will send a v2 on top of recent changes to the test.

> > ---
> >  tools/testing/selftests/seccomp/seccomp_bpf.c | 24 +++
> >  1 file changed, 20 insertions(+), 4 deletions(-)
> > 
> > diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
> > b/tools/testing/selftests/seccomp/seccomp_bpf.c
> > index 252140a52553..b90a9190ba88 100644
> > --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> > +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> > @@ -1738,6 +1738,14 @@ void change_syscall(struct __test_metadata 
> > *_metadata,
> > TH_LOG("Can't modify syscall return on this architecture");
> >  #else
> > regs.SYSCALL_RET = result;
> > +# if defined(__powerpc__)
> > +   if (result < 0) {
> > +   regs.SYSCALL_RET = -result;
> > +   regs.ccr |= 0x1000;
> > +   } else {
> > +   regs.ccr &= ~0x1000;
> > +   }
> > +# endif
> >  #endif
> 
> Just so I understand correctly: for ppc to "see" this result, it needs
> to be both negative AND have this specific register set?
> 

Yes. According to Documentation/powerpc/syscall64-abi.rst:

"
- For the sc instruction, both a value and an error condition are returned.
  cr0.SO is the error condition, and r3 is the return value. When cr0.SO is
  clear, the syscall succeeded and r3 is the return value. When cr0.SO is set,
  the syscall failed and r3 is the error value (that normally corresponds to
  errno).
"

So, while some other arches will return -EINVAL, ppc returns EINVAL. And that
is distinguished from, say, read(2) returning 22 bytes read, by using CR.SO.

> >  
> >  #ifdef HAVE_GETREGS
> > @@ -1796,6 +1804,7 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
> > pid_t tracee,
> > int ret, nr;
> > unsigned long msg;
> > static bool entry;
> > +   int *syscall_nr = args;
> >  
> > /*
> >  * The traditional way to tell PTRACE_SYSCALL entry/exit
> > @@ -1809,10 +1818,15 @@ void tracer_ptrace(struct __test_metadata 
> > *_metadata, pid_t tracee,
> > EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
> > : PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
> >  
> > -   if (!entry)
> > +   if (!entry && !syscall_nr)
> > return;
> >  
> > -   nr = get_syscall(_metadata, tracee);
> > +   if (entry)
> > +   nr = get_syscall(_metadata, tracee);
> > +   else
> > +   nr = *syscall_nr;
> 
> This is weird? Shouldn't get_syscall() be modified to do the right thing
> here instead of depending on the extra arg?
> 

R0 might be clobered. Same documentation mentions it as volatile. So, during
syscall exit, we can't tell for sure that R0 will have the original syscall
number. So, we need to grab it during syscall enter, save it somewhere and
reuse it. I used the test context/args for that. That's the main change I had
to deal with after recent changes to the test. I used the variant struct now.

I only saw the need to do this under tracer_ptrace, as that was the only one
changing syscall return values using ptrace. And that can only be done during
syscall exit on ppc (ptrace ABI we can't break). So, changing get_syscall did
not seem necessary.

Thanks.
Cascardo.

> > +   if (syscall_nr)
> > +   *syscall_nr = nr;
> >  
> > if (nr == __NR_getpid)
> > change_syscall(_metadata, tracee, __NR_getppid, 0);
> > @@ -1889,9 +1903,10 @@ TEST_F(TRACE_syscall, ptrace_syscall_redirected)
> >  
> >  TEST_F(TRACE_syscall, ptrace_syscall_errno

Re: [PATCH] KVM: PPC: Don't return -ENOTSUPP to userspace in ioctls

2020-09-11 Thread Thadeu Lima de Souza Cascardo
On Fri, Sep 11, 2020 at 12:53:45PM +0200, Greg Kurz wrote:
> ENOTSUPP is a linux only thingy, the value of which is unknown to
> userspace, not to be confused with ENOTSUP which linux maps to
> EOPNOTSUPP, as permitted by POSIX [1]:
> 
> [EOPNOTSUPP]
> Operation not supported on socket. The type of socket (address family
> or protocol) does not support the requested operation. A conforming
> implementation may assign the same values for [EOPNOTSUPP] and [ENOTSUP].
> 
> Return -EOPNOTSUPP instead of -ENOTSUPP for the following ioctls:
> - KVM_GET_FPU for Book3s and BookE
> - KVM_SET_FPU for Book3s and BookE
> - KVM_GET_DIRTY_LOG for BookE
> 
> This doesn't affect QEMU which doesn't call the KVM_GET_FPU and
> KVM_SET_FPU ioctls on POWER anyway since they are not supported,
> and _buggily_ ignores anything but -EPERM for KVM_GET_DIRTY_LOG.
> 
> [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html
> 
> Signed-off-by: Greg Kurz 

Agreed. ENOTSUPP should never be returned to userspace.

Acked-by: Thadeu Lima de Souza Cascardo 

> ---
>  arch/powerpc/kvm/book3s.c |4 ++--
>  arch/powerpc/kvm/booke.c  |6 +++---
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 1fce9777af1c..44bf567b6589 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -558,12 +558,12 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, 
> struct kvm_regs *regs)
>  
>  int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
>  {
> - return -ENOTSUPP;
> + return -EOPNOTSUPP;
>  }
>  
>  int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
>  {
> - return -ENOTSUPP;
> + return -EOPNOTSUPP;
>  }
>  
>  int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index 3e1c9f08e302..b1abcb816439 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -1747,12 +1747,12 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
>  
>  int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
>  {
> - return -ENOTSUPP;
> + return -EOPNOTSUPP;
>  }
>  
>  int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
>  {
> - return -ENOTSUPP;
> + return -EOPNOTSUPP;
>  }
>  
>  int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
> @@ -1773,7 +1773,7 @@ void kvm_arch_sync_dirty_log(struct kvm *kvm, struct 
> kvm_memory_slot *memslot)
>  
>  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>  {
> - return -ENOTSUPP;
> + return -EOPNOTSUPP;
>  }
>  
>  void kvmppc_core_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> 
> 


Please apply commit 0828137e8f16 ("powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()") to v4.14.y, v4.19.y, v5.4.y, v5.7.y

2020-08-25 Thread Thadeu Lima de Souza Cascardo
After commit 912c0a7f2b5daa3cbb2bc10f303981e493de73bd ("powerpc/64s: Save FSCR
to init_task.thread.fscr after feature init"), which has been applied to the
referred branches, when userspace sets the user DSCR MSR, it won't be inherited
or restored during context switch, because the facility unavailable interrupt
won't trigger.

Applying 0828137e8f16721842468e33df0460044a0c588b ("powerpc/64s: Don't init
FSCR_DSCR in __init_FSCR()") will fix it.

Cascardo.


Re: [PATCH v2] powerpc/vio: drop bus_type from parent device

2020-07-30 Thread Thadeu Lima de Souza Cascardo
On Thu, Jul 30, 2020 at 07:37:16AM +0200, Greg KH wrote:
> On Thu, Jul 30, 2020 at 11:28:38AM +1000, Michael Ellerman wrote:
> > [ Added Peter & Greg to Cc ]
> > 
> > Thadeu Lima de Souza Cascardo  writes:
> > > Commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ("kobject: return error
> > > code if writing /sys/.../uevent fails") started returning failure when
> > > writing to /sys/devices/vio/uevent.
> > >
> > > This causes an early udevadm trigger to fail. On some installer versions 
> > > of
> > > Ubuntu, this will cause init to exit, thus panicing the system very early
> > > during boot.
> > >
> > > Removing the bus_type from the parent device will remove some of the extra
> > > empty files from /sys/devices/vio/, but will keep the rest of the layout
> > > for vio devices, keeping them under /sys/devices/vio/.
> > 
> > What exactly does it change?
> > 
> > I'm finding it hard to evaluate if this change is going to cause a
> > regression somehow.
> > 
> > I'm also not clear on why removing the bus type is correct, apart from
> > whether it fixes the bug you're seeing.
> > 
> > > It has been tested that uevents for vio devices don't change after this
> > > fix, they still contain MODALIAS.
> > >
> > > Signed-off-by: Thadeu Lima de Souza Cascardo 
> > > Fixes: df44b479654f ("kobject: return error code if writing 
> > > /sys/.../uevent fails")
> > 
> > AFAICS there haven't been any other fixes for that commit. Do we know
> > why it is only vio that was affected? (possibly because it's a fake bus
> > to begin with?)
> 
> So there was an error previously, the core was ignoring it, and now it
> isn't and to fix that you want to remove describing what bus a device is
> on?
> 
> Huh???
> 
> > 
> > cheers
> > 
> > > diff --git a/arch/powerpc/platforms/pseries/vio.c 
> > > b/arch/powerpc/platforms/pseries/vio.c
> > > index 37f1f25ba804..a94dab3972a0 100644
> > > --- a/arch/powerpc/platforms/pseries/vio.c
> > > +++ b/arch/powerpc/platforms/pseries/vio.c
> > > @@ -36,7 +36,6 @@ static struct vio_dev vio_bus_device  = { /* fake 
> > > "parent" device */
> > >   .name = "vio",
> > >   .type = "",
> > >   .dev.init_name = "vio",
> > > - .dev.bus = _bus_type,
> > >  };
> 
> Wait, a static 'struct device'?  You all are playing with fire there.
> That's a reference counted object, and should never be declared like
> that at all.
> 
> I see you register it, but never unregister it, why?  Why is it even
> needed?
> 
> And if you remove the bus type of it, it will show up in a different
> part of sysfs, so I think this patch will show a user-visable change,
> right?
> 
> thanks,
> 
> greg k-h

As the comment says, it's a "fake parent device". There is a user-visible
change, which is removing some attributes from the object, but it's still
showing up on the same path.

Returning an error code like df44b479654f does is also a user visible change
and it breaks installer images that panic early on boot.

I could investigate an alternative here, which would be not fail when writing
to uevent for this specific fake device.

Cascardo.


[PATCH] selftests/powerpc: return skip code for spectre_v2

2020-07-28 Thread Thadeu Lima de Souza Cascardo
When running under older versions of qemu of under newer versions with old
machine types, some security features will not be reported to the guest.
This will lead the guest OS to consider itself Vulnerable to spectre_v2.

So, spectre_v2 test fails in such cases when the host is mitigated and miss
predictions cannot be detected as expected by the test.

Make it return the skip code instead, for this particular case. We don't
want to miss the case when the test fails and the system reports as
mitigated or not affected. But it is not a problem to miss failures when
the system reports as Vulnerable.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 tools/testing/selftests/powerpc/security/spectre_v2.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/tools/testing/selftests/powerpc/security/spectre_v2.c 
b/tools/testing/selftests/powerpc/security/spectre_v2.c
index 8c6b982af2a8..d5445bfd63ed 100644
--- a/tools/testing/selftests/powerpc/security/spectre_v2.c
+++ b/tools/testing/selftests/powerpc/security/spectre_v2.c
@@ -183,6 +183,14 @@ int spectre_v2_test(void)
if (miss_percent > 15) {
printf("Branch misses > 15%% unexpected in this 
configuration!\n");
printf("Possible mis-match between reported & actual 
mitigation\n");
+   /* Such a mismatch may be caused by a guest system
+* reporting as vulnerable when the host is mitigated.
+* Return skip code to avoid detecting this as an
+* error. We are not vulnerable and reporting otherwise,
+* so missing such a mismatch is safe.
+*/
+   if (state == VULNERABLE)
+   return 4;
return 1;
}
break;
-- 
2.25.1



[PATCH] selftests/seccomp: fix ptrace tests on powerpc

2020-06-30 Thread Thadeu Lima de Souza Cascardo
As pointed out by Michael Ellerman, the ptrace ABI on powerpc does not
allow or require the return code to be set on syscall entry when
skipping the syscall. It will always return ENOSYS and the return code
must be set on syscall exit.

This code does that, behaving more similarly to strace. It still sets
the return code on entry, which is overridden on powerpc, and it will
always repeat the same on exit. Also, on powerpc, the errno is not
inverted, and depends on ccr.so being set.

This has been tested on powerpc and amd64.

Cc: Michael Ellerman 
Cc: Kees Cook 
Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 24 +++
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 252140a52553..b90a9190ba88 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -1738,6 +1738,14 @@ void change_syscall(struct __test_metadata *_metadata,
TH_LOG("Can't modify syscall return on this architecture");
 #else
regs.SYSCALL_RET = result;
+# if defined(__powerpc__)
+   if (result < 0) {
+   regs.SYSCALL_RET = -result;
+   regs.ccr |= 0x1000;
+   } else {
+   regs.ccr &= ~0x1000;
+   }
+# endif
 #endif
 
 #ifdef HAVE_GETREGS
@@ -1796,6 +1804,7 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
pid_t tracee,
int ret, nr;
unsigned long msg;
static bool entry;
+   int *syscall_nr = args;
 
/*
 * The traditional way to tell PTRACE_SYSCALL entry/exit
@@ -1809,10 +1818,15 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
pid_t tracee,
EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
: PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
 
-   if (!entry)
+   if (!entry && !syscall_nr)
return;
 
-   nr = get_syscall(_metadata, tracee);
+   if (entry)
+   nr = get_syscall(_metadata, tracee);
+   else
+   nr = *syscall_nr;
+   if (syscall_nr)
+   *syscall_nr = nr;
 
if (nr == __NR_getpid)
change_syscall(_metadata, tracee, __NR_getppid, 0);
@@ -1889,9 +1903,10 @@ TEST_F(TRACE_syscall, ptrace_syscall_redirected)
 
 TEST_F(TRACE_syscall, ptrace_syscall_errno)
 {
+   int syscall_nr = -1;
/* Swap SECCOMP_RET_TRACE tracer for PTRACE_SYSCALL tracer. */
teardown_trace_fixture(_metadata, self->tracer);
-   self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, NULL,
+   self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, 
_nr,
   true);
 
/* Tracer should skip the open syscall, resulting in ESRCH. */
@@ -1900,9 +1915,10 @@ TEST_F(TRACE_syscall, ptrace_syscall_errno)
 
 TEST_F(TRACE_syscall, ptrace_syscall_faked)
 {
+   int syscall_nr = -1;
/* Swap SECCOMP_RET_TRACE tracer for PTRACE_SYSCALL tracer. */
teardown_trace_fixture(_metadata, self->tracer);
-   self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, NULL,
+   self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, 
_nr,
   true);
 
/* Tracer should skip the gettid syscall, resulting fake pid. */
-- 
2.25.1



[PATCH v2] powerpc/vio: drop bus_type from parent device

2020-04-06 Thread Thadeu Lima de Souza Cascardo
Commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ("kobject: return error
code if writing /sys/.../uevent fails") started returning failure when
writing to /sys/devices/vio/uevent.

This causes an early udevadm trigger to fail. On some installer versions of
Ubuntu, this will cause init to exit, thus panicing the system very early
during boot.

Removing the bus_type from the parent device will remove some of the extra
empty files from /sys/devices/vio/, but will keep the rest of the layout
for vio devices, keeping them under /sys/devices/vio/.

It has been tested that uevents for vio devices don't change after this
fix, they still contain MODALIAS.

Signed-off-by: Thadeu Lima de Souza Cascardo 
Fixes: df44b479654f ("kobject: return error code if writing /sys/.../uevent 
fails")
---
 arch/powerpc/platforms/pseries/vio.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/vio.c 
b/arch/powerpc/platforms/pseries/vio.c
index 37f1f25ba804..a94dab3972a0 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -36,7 +36,6 @@ static struct vio_dev vio_bus_device  = { /* fake "parent" 
device */
.name = "vio",
.type = "",
.dev.init_name = "vio",
-   .dev.bus = _bus_type,
 };
 
 #ifdef CONFIG_PPC_SMLPAR
-- 
2.20.1



[PATCH v2] powerpc/ptrace: Do not return ENOSYS if invalid syscall

2020-03-29 Thread Thadeu Lima de Souza Cascardo
If a tracer sets the syscall number to an invalid one, allow the return
value set by the tracer to be returned the tracee.

The test for NR_syscalls is already at entry_64.S, and it's at
do_syscall_trace_enter only to skip audit and trace.

After this, two failures from seccomp_bpf selftests complete just fine,
as the failing test was using ptrace to change the syscall to return an
error or a fake value, but were failing as it was always returning
-ENOSYS.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 arch/powerpc/kernel/ptrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 25c0424e8868..557ae4bc2331 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -3314,7 +3314,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
 
/* Avoid trace and audit when syscall is invalid. */
if (regs->gpr[0] >= NR_syscalls)
-   goto skip;
+   return regs->gpr[0];
 
if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
trace_sys_enter(regs, regs->gpr[0]);
-- 
2.17.1



[PATCH] libbpf: Fix readelf output parsing for Fedora

2019-12-13 Thread Thadeu Lima de Souza Cascardo
Fedora binutils has been patched to show "other info" for a symbol at the
end of the line. This was done in order to support unmaintained scripts
that would break with the extra info. [1]

[1] 
https://src.fedoraproject.org/rpms/binutils/c/b8265c46f7ddae23a792ee8306fbaaeacba83bf8

This in turn has been done to fix the build of ruby, because of checksec.
[2] Thanks Michael Ellerman for the pointer.

[2] https://bugzilla.redhat.com/show_bug.cgi?id=1479302

As libbpf Makefile is not unmaintained, we can simply deal with either
output format, by just removing the "other info" field, as it always comes
inside brackets.

Cc: Aurelien Jarno 
Fixes: 3464afdf11f9 (libbpf: Fix readelf output parsing on powerpc with recent 
binutils)
Reported-by: Justin Forbes 
Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 tools/lib/bpf/Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index defae23a0169..23ae06c43d08 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -147,6 +147,7 @@ TAGS_PROG := $(if $(shell which etags 
2>/dev/null),etags,ctags)
 
 GLOBAL_SYM_COUNT = $(shell readelf -s --wide $(BPF_IN_SHARED) | \
   cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | \
+  sed 's/\[.*\]//' | \
   awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$NF}' | 
\
   sort -u | wc -l)
 VERSIONED_SYM_COUNT = $(shell readelf -s --wide $(OUTPUT)libbpf.so | \
@@ -213,6 +214,7 @@ check_abi: $(OUTPUT)libbpf.so
 "versioned in $(VERSION_SCRIPT)." >&2;  \
readelf -s --wide $(BPF_IN_SHARED) | \
cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' |   \
+   sed 's/\[.*\]//' |   \
awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$NF}'|  \
sort -u > $(OUTPUT)libbpf_global_syms.tmp;   \
readelf -s --wide $(OUTPUT)libbpf.so |   \
-- 
2.24.0



Re: [PATCH] libbpf: fix readelf output parsing on powerpc with recent binutils

2019-12-11 Thread Thadeu Lima de Souza Cascardo
On Wed, Dec 11, 2019 at 09:33:53AM -0600, Justin Forbes wrote:
> On Tue, Dec 10, 2019 at 4:26 PM Thadeu Lima de Souza Cascardo
>  wrote:
> >
> > On Tue, Dec 10, 2019 at 12:58:33PM -0600, Justin Forbes wrote:
> > > On Mon, Dec 2, 2019 at 3:37 AM Daniel Borkmann  
> > > wrote:
> > > >
> > > > On Mon, Dec 02, 2019 at 04:53:26PM +1100, Michael Ellerman wrote:
> > > > > Aurelien Jarno  writes:
> > > > > > On powerpc with recent versions of binutils, readelf outputs an 
> > > > > > extra
> > > > > > field when dumping the symbols of an object file. For example:
> > > > > >
> > > > > > 35: 083896 FUNCLOCAL  DEFAULT 
> > > > > > [: 8] 1 btf_is_struct
> > > > > >
> > > > > > The extra "[: 8]" prevents the GLOBAL_SYM_COUNT 
> > > > > > variable to
> > > > > > be computed correctly and causes the checkabi target to fail.
> > > > > >
> > > > > > Fix that by looking for the symbol name in the last field instead 
> > > > > > of the
> > > > > > 8th one. This way it should also cope with future extra fields.
> > > > > >
> > > > > > Signed-off-by: Aurelien Jarno 
> > > > > > ---
> > > > > >  tools/lib/bpf/Makefile | 4 ++--
> > > > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > >
> > > > > Thanks for fixing that, it's been on my very long list of test 
> > > > > failures
> > > > > for a while.
> > > > >
> > > > > Tested-by: Michael Ellerman 
> > > >
> > > > Looks good & also continues to work on x86. Applied, thanks!
> > >
> > > This actually seems to break horribly on PPC64le with binutils 2.33.1
> > > resulting in:
> > > Warning: Num of global symbols in sharedobjs/libbpf-in.o (32) does NOT
> > > match with num of versioned symbols in libbpf.so (184). Please make
> > > sure all LIBBPF_API symbols are versioned in libbpf.map.
> > >
> > > This is the only arch that fails, with x86/arm/aarch64/s390 all
> > > building fine.  Reverting this patch allows successful build across
> > > all arches.
> > >
> > > Justin
> >
> > Well, I ended up debugging this same issue and had the same fix as Jarno's 
> > when
> > I noticed his fix was already applied.
> >
> > I just installed a system with the latest binutils, 2.33.1, and it still 
> > breaks
> > without such fix. Can you tell what is the output of the following command 
> > on
> > your system?
> >
> > readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@" -f1 | 
> > sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ && !/UND/ 
> > {print $0}'
> >
> 
> readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@"
> -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ &&
> !/UND/ {print $0}'
>373: 000141bc  1376 FUNCGLOBAL DEFAULT1
> libbpf_num_possible_cpus [: 8]
>375: 0001869c   176 FUNCGLOBAL DEFAULT1 btf__free
> [: 8]
[...]

This is a patch on binutils carried by Fedora:

https://src.fedoraproject.org/rpms/binutils/c/b8265c46f7ddae23a792ee8306fbaaeacba83bf8

" b8265c Have readelf display extra symbol information at the end of the line. "

It has the following comment:

# FIXME:The proper fix would be to update the scripts that are expecting
#   a fixed output from readelf.  But it seems that some of them are
#   no longer being maintained.

This commit is from 2017, had it been on binutils upstream, maybe the situation
right now would be different.

Honestly, it seems the best way out is to filter the other information in the
libbpf Makefile.

Does the following patch work for you?


diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 56ce6292071b..e6f99484d7d5 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -145,6 +145,7 @@ PC_FILE := $(addprefix $(OUTPUT),$(PC_FILE))
 
 GLOBAL_SYM_COUNT = $(shell readelf -s --wide $(BPF_IN_SHARED) | \
   cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | \
+  sed 's/\[.*\]//' | \
   awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$8}' | \
   sort -u | wc -l)
 VERSIONED_SYM_COUNT = $(shell readelf -s --wide $(OUTPUT)libbpf.so | \
@@ -217,6 +218,7 @@ check_abi: $(OUTPUT)libbpf.so
 "versioned in $(VERSION_SCRIPT)." >&2;  \
readelf -s --wide $(OUTPUT)libbpf-in.o | \
cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' |   \
+   sed 's/\[.*\]//' |   \
awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$8}'|   \
sort -u > $(OUTPUT)libbpf_global_syms.tmp;   \
readelf -s --wide $(OUTPUT)libbpf.so |   \


Re: [PATCH] libbpf: fix readelf output parsing on powerpc with recent binutils

2019-12-10 Thread Thadeu Lima de Souza Cascardo
On Tue, Dec 10, 2019 at 12:58:33PM -0600, Justin Forbes wrote:
> On Mon, Dec 2, 2019 at 3:37 AM Daniel Borkmann  wrote:
> >
> > On Mon, Dec 02, 2019 at 04:53:26PM +1100, Michael Ellerman wrote:
> > > Aurelien Jarno  writes:
> > > > On powerpc with recent versions of binutils, readelf outputs an extra
> > > > field when dumping the symbols of an object file. For example:
> > > >
> > > > 35: 083896 FUNCLOCAL  DEFAULT [: 8] 
> > > > 1 btf_is_struct
> > > >
> > > > The extra "[: 8]" prevents the GLOBAL_SYM_COUNT variable to
> > > > be computed correctly and causes the checkabi target to fail.
> > > >
> > > > Fix that by looking for the symbol name in the last field instead of the
> > > > 8th one. This way it should also cope with future extra fields.
> > > >
> > > > Signed-off-by: Aurelien Jarno 
> > > > ---
> > > >  tools/lib/bpf/Makefile | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > Thanks for fixing that, it's been on my very long list of test failures
> > > for a while.
> > >
> > > Tested-by: Michael Ellerman 
> >
> > Looks good & also continues to work on x86. Applied, thanks!
> 
> This actually seems to break horribly on PPC64le with binutils 2.33.1
> resulting in:
> Warning: Num of global symbols in sharedobjs/libbpf-in.o (32) does NOT
> match with num of versioned symbols in libbpf.so (184). Please make
> sure all LIBBPF_API symbols are versioned in libbpf.map.
> 
> This is the only arch that fails, with x86/arm/aarch64/s390 all
> building fine.  Reverting this patch allows successful build across
> all arches.
> 
> Justin

Well, I ended up debugging this same issue and had the same fix as Jarno's when
I noticed his fix was already applied.

I just installed a system with the latest binutils, 2.33.1, and it still breaks
without such fix. Can you tell what is the output of the following command on
your system?

readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@" -f1 | sed 
's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $0}' 

Cascardo.


[PATCH] powerpc/vio: drop bus_type from parent device

2019-09-27 Thread Thadeu Lima de Souza Cascardo
Commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ("kobject: return error code if
writing /sys/.../uevent fails") started returning failure when writing to
/sys/devices/vio/uevent.

This causes an early udevadm trigger to fail. On some installer versions of
Ubuntu, this will cause init to exit, thus panicing the system very early
during boot.

Removing the bus_type from the parent device will remove some of the extra
empty files from /sys/devices/vio/, but will keep the rest of the layout for
vio devices, keeping them under /sys/devices/vio/.

It has been tested that uevents for vio devices don't change after this fix,
they still contain MODALIAS.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 arch/powerpc/platforms/pseries/vio.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/vio.c 
b/arch/powerpc/platforms/pseries/vio.c
index 6601b9d404dc..d570d5494f30 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -36,7 +36,6 @@ static struct vio_dev vio_bus_device  = { /* fake "parent" 
device */
.name = "vio",
.type = "",
.dev.init_name = "vio",
-   .dev.bus = _bus_type,
 };
 
 #ifdef CONFIG_PPC_SMLPAR
-- 
2.20.1



Re: [PATCH] powerpc/ptrace: Do not return ENOSYS if invalid syscall

2019-09-26 Thread Thadeu Lima de Souza Cascardo
On Tue, Sep 10, 2019 at 10:01:22PM -0300, Thadeu Lima de Souza Cascardo wrote:
> If a tracer sets the syscall number to an invalid one, allow the return
> value set by the tracer to be returned the tracee.
> 
> The test for NR_syscalls is already at entry_64.S, and it's at
> do_syscall_trace_enter only to skip audit and trace.
> 
> After this, seccomp_bpf selftests complete just fine, as the failing test
> was using ptrace to change the syscall to return an error or a fake value,
> but were failing as it was always returning -ENOSYS.
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo 
> ---
>  arch/powerpc/kernel/ptrace.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
> index 8c92febf5f44..87315335f66a 100644
> --- a/arch/powerpc/kernel/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace.c
> @@ -3316,7 +3316,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
>  
>   /* Avoid trace and audit when syscall is invalid. */
>   if (regs->gpr[0] >= NR_syscalls)
> - goto skip;
> + return regs->gpr[0];
>  
>   if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
>   trace_sys_enter(regs, regs->gpr[0]);

Ping? Any comments on this?

Thanks.
Cascardo.


[PATCH] powerpc/ptrace: Do not return ENOSYS if invalid syscall

2019-09-10 Thread Thadeu Lima de Souza Cascardo
If a tracer sets the syscall number to an invalid one, allow the return
value set by the tracer to be returned the tracee.

The test for NR_syscalls is already at entry_64.S, and it's at
do_syscall_trace_enter only to skip audit and trace.

After this, seccomp_bpf selftests complete just fine, as the failing test
was using ptrace to change the syscall to return an error or a fake value,
but were failing as it was always returning -ENOSYS.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 arch/powerpc/kernel/ptrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 8c92febf5f44..87315335f66a 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -3316,7 +3316,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
 
/* Avoid trace and audit when syscall is invalid. */
if (regs->gpr[0] >= NR_syscalls)
-   goto skip;
+   return regs->gpr[0];
 
if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
trace_sys_enter(regs, regs->gpr[0]);
-- 
2.20.1



Re: [PATCH v3] ppc64 boot: Wait for boot cpu to show up if nr_cpus limit is about to hit.

2017-11-29 Thread Thadeu Lima de Souza Cascardo
Ping. Any further reviews/comments? Or any chance of this getting
applied soon?

Thanks.
Cascardo.


Re: [PATCH v3] ppc64 boot: Wait for boot cpu to show up if nr_cpus limit is about to hit.

2017-11-17 Thread Thadeu Lima de Souza Cascardo
On Thu, Nov 16, 2017 at 07:26:43PM -0200, Guilherme G. Piccoli wrote:
> On 11/06/2017 03:34 PM, Thadeu Lima de Souza Cascardo wrote:
> > From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>
> > 
> > The kernel boot parameter 'nr_cpus=' allows one to specify number of
> > possible cpus in the system. In the normal scenario the first cpu (cpu0)
> > that shows up is the boot cpu and hence it gets covered under nr_cpus
> > limit.
> > 
> > But this assumption will be broken in kdump scenario where kdump kenrel
> 
> Minor nit: s/kenrel/kernel
> 
> > after a crash can boot up on an non-zero boot cpu. The paca structure
> > allocation depends on value of nr_cpus and is indexed using logical cpu
> > ids. This definetly will be an issue if boot cpu id > nr_cpus
> 
> Minor nit (2): s/definetly/definitely

Hi, Guilherme.

Thanks a lot for testing and reviewing the patch. I will ignore those
small typos, unless the maintainers tell me to fix them. I hope you
don't mind.

I would like to see this merged sooner rather than later.

Thanks.
Cascardo.

> > 
> > This patch modifies allocate_pacas() and smp_setup_cpu_maps() to
> > accommodate boot cpu for the case where boot_cpuid > nr_cpu_ids.
> > 
> > This change would help to reduce the memory reservation requirement for
> > kdump on ppc64.
> > 
> > Signed-off-by: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>
> > Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@canonical.com>
> 
> Tested-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
> 
> Without this patch, got a crash deterministically when booting with
> nr_cpus=1 in P8 bare-metal. The patch fixes the issue...
> 
> Thanks,
> 
> 
> Guilherme
> 
> 
> > ---
> > 
> > v3: fixup signedness or nr_cpus to match nr_cpu_ids
> >  and fix conflict due to change from %d to %u
> > 
> > Resending this as it was not applied, and I can reproduce the issue with
> > v4.14-rc8 when booting a kdump kernel after a crash that has been given
> > nr_cpus=1 as a parameter. With this patch, I can't reproduce it anymore.
> > 
> > ---
> >  arch/powerpc/include/asm/paca.h|  3 +++
> >  arch/powerpc/include/asm/smp.h |  1 +
> >  arch/powerpc/kernel/paca.c | 23 +-
> >  arch/powerpc/kernel/prom.c | 39 
> > +-
> >  arch/powerpc/kernel/setup-common.c | 25 
> >  5 files changed, 85 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/paca.h 
> > b/arch/powerpc/include/asm/paca.h
> > index 04b60af027ae..ea0dbf2bbeef 100644
> > --- a/arch/powerpc/include/asm/paca.h
> > +++ b/arch/powerpc/include/asm/paca.h
> > @@ -49,6 +49,9 @@ extern unsigned int debug_smp_processor_id(void); /* from 
> > linux/smp.h */
> >  #define get_lppaca()   (get_paca()->lppaca_ptr)
> >  #define get_slb_shadow()   (get_paca()->slb_shadow_ptr)
> > 
> > +/* Maximum number of threads per core. */
> > +#defineMAX_SMT 8
> > +
> >  struct task_struct;
> > 
> >  /*
> > diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
> > index fac963e10d39..553cd22b2ccc 100644
> > --- a/arch/powerpc/include/asm/smp.h
> > +++ b/arch/powerpc/include/asm/smp.h
> > @@ -30,6 +30,7 @@
> >  #include 
> > 
> >  extern int boot_cpuid;
> > +extern int boot_hw_cpuid;
> >  extern int spinning_secondaries;
> > 
> >  extern void cpu_die(void);
> > diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
> > index 2ff2b8a19f71..9c689ee4b6a3 100644
> > --- a/arch/powerpc/kernel/paca.c
> > +++ b/arch/powerpc/kernel/paca.c
> > @@ -207,6 +207,7 @@ void __init allocate_pacas(void)
> >  {
> > u64 limit;
> > int cpu;
> > +   unsigned int nr_cpus;
> > 
> > limit = ppc64_rma_size;
> > 
> > @@ -219,20 +220,32 @@ void __init allocate_pacas(void)
> > limit = min(0x1000ULL, limit);
> >  #endif
> > 
> > -   paca_size = PAGE_ALIGN(sizeof(struct paca_struct) * nr_cpu_ids);
> > +   /*
> > +* Always align up the nr_cpu_ids to SMT threads and allocate
> > +* the paca. This will help us to prepare for a situation where
> > +* boot cpu id > nr_cpus_id. We will use the last nthreads
> > +* slots (nthreads == threads per core) to accommodate a core
> > +* that contains boot cpu thread.
> > +*
> > +* Do not change nr_cpu_ids value here. Let us do that in
> &

[PATCH v3] ppc64 boot: Wait for boot cpu to show up if nr_cpus limit is about to hit.

2017-11-06 Thread Thadeu Lima de Souza Cascardo
From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>

The kernel boot parameter 'nr_cpus=' allows one to specify number of
possible cpus in the system. In the normal scenario the first cpu (cpu0)
that shows up is the boot cpu and hence it gets covered under nr_cpus
limit.

But this assumption will be broken in kdump scenario where kdump kenrel
after a crash can boot up on an non-zero boot cpu. The paca structure
allocation depends on value of nr_cpus and is indexed using logical cpu
ids. This definetly will be an issue if boot cpu id > nr_cpus

This patch modifies allocate_pacas() and smp_setup_cpu_maps() to
accommodate boot cpu for the case where boot_cpuid > nr_cpu_ids.

This change would help to reduce the memory reservation requirement for
kdump on ppc64.

Signed-off-by: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@canonical.com>
---

v3: fixup signedness or nr_cpus to match nr_cpu_ids
 and fix conflict due to change from %d to %u

Resending this as it was not applied, and I can reproduce the issue with
v4.14-rc8 when booting a kdump kernel after a crash that has been given
nr_cpus=1 as a parameter. With this patch, I can't reproduce it anymore.

---
 arch/powerpc/include/asm/paca.h|  3 +++
 arch/powerpc/include/asm/smp.h |  1 +
 arch/powerpc/kernel/paca.c | 23 +-
 arch/powerpc/kernel/prom.c | 39 +-
 arch/powerpc/kernel/setup-common.c | 25 
 5 files changed, 85 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 04b60af027ae..ea0dbf2bbeef 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -49,6 +49,9 @@ extern unsigned int debug_smp_processor_id(void); /* from 
linux/smp.h */
 #define get_lppaca()   (get_paca()->lppaca_ptr)
 #define get_slb_shadow()   (get_paca()->slb_shadow_ptr)
 
+/* Maximum number of threads per core. */
+#defineMAX_SMT 8
+
 struct task_struct;
 
 /*
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index fac963e10d39..553cd22b2ccc 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -30,6 +30,7 @@
 #include 
 
 extern int boot_cpuid;
+extern int boot_hw_cpuid;
 extern int spinning_secondaries;
 
 extern void cpu_die(void);
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 2ff2b8a19f71..9c689ee4b6a3 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -207,6 +207,7 @@ void __init allocate_pacas(void)
 {
u64 limit;
int cpu;
+   unsigned int nr_cpus;
 
limit = ppc64_rma_size;
 
@@ -219,20 +220,32 @@ void __init allocate_pacas(void)
limit = min(0x1000ULL, limit);
 #endif
 
-   paca_size = PAGE_ALIGN(sizeof(struct paca_struct) * nr_cpu_ids);
+   /*
+* Always align up the nr_cpu_ids to SMT threads and allocate
+* the paca. This will help us to prepare for a situation where
+* boot cpu id > nr_cpus_id. We will use the last nthreads
+* slots (nthreads == threads per core) to accommodate a core
+* that contains boot cpu thread.
+*
+* Do not change nr_cpu_ids value here. Let us do that in
+* early_init_dt_scan_cpus() where we know exact value
+* of threads per core.
+*/
+   nr_cpus = _ALIGN_UP(nr_cpu_ids, MAX_SMT);
+   paca_size = PAGE_ALIGN(sizeof(struct paca_struct) * nr_cpus);
 
paca = __va(memblock_alloc_base(paca_size, PAGE_SIZE, limit));
memset(paca, 0, paca_size);
 
printk(KERN_DEBUG "Allocated %u bytes for %u pacas at %p\n",
-   paca_size, nr_cpu_ids, paca);
+   paca_size, nr_cpus, paca);
 
-   allocate_lppacas(nr_cpu_ids, limit);
+   allocate_lppacas(nr_cpus, limit);
 
-   allocate_slb_shadows(nr_cpu_ids, limit);
+   allocate_slb_shadows(nr_cpus, limit);
 
/* Can't use for_each_*_cpu, as they aren't functional yet */
-   for (cpu = 0; cpu < nr_cpu_ids; cpu++)
+   for (cpu = 0; cpu < nr_cpus; cpu++)
initialise_paca([cpu], cpu);
 }
 
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index f83056297441..93837093c5cb 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -302,6 +302,29 @@ static void __init check_cpu_feature_properties(unsigned 
long node)
}
 }
 
+/*
+ * Adjust the logical id of a boot cpu to fall under nr_cpu_ids. Map it to
+ * last core slot in the allocated paca array.
+ *
+ * e.g. on SMT=8 system, kernel booted with nr_cpus=1 and boot cpu = 33,
+ * align nr_cpu_ids to MAX_SMT value 8. Allocate paca array to hold up-to
+ * MAX_SMT=8 cpus. Since boot cpu 33 is greater than nr_cpus (8), adjust
+ * its logical id so that new id becomes less than nr_cpu_ids. Make s

[PATCH v2] powerpc: make /proc/self/stack always print the current stack

2017-03-27 Thread Thadeu Lima de Souza Cascardo
For the current task, the kernel stack would only tell the last time the
process was rescheduled, if ever. Use the current stack pointer for the
current task.

Otherwise, every once in a while, the stacktrace printed when reading
/proc/self/stack would look like the process is running in userspace,
while it's not, which some may consider as a bug.

This is also consistent with some other architectures, like x86 and arm,
at least.

Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@canonical.com>
---
 arch/powerpc/kernel/stacktrace.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index 6671195..d534ed9 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -59,7 +59,14 @@ EXPORT_SYMBOL_GPL(save_stack_trace);
 
 void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
 {
-   save_context_stack(trace, tsk->thread.ksp, tsk, 0);
+   unsigned long sp;
+
+   if (tsk == current)
+   sp = current_stack_pointer();
+   else
+   sp = tsk->thread.ksp;
+
+   save_context_stack(trace, sp, tsk, 0);
 }
 EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
 
-- 
2.9.3



[PATCH] powerpc: fix /proc/self/stack

2017-03-17 Thread Thadeu Lima de Souza Cascardo
For the current task, the kernel stack would only tell the last time the
process was rescheduled, if ever. Use the current stack pointer for the
current task.

This is also consistent with some other architectures.

Signed-off-by: Thadeu Lima de Souza Cascardo <casca...@canonical.com>
---
 arch/powerpc/kernel/stacktrace.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index 6671195..2446066 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -59,7 +59,12 @@ EXPORT_SYMBOL_GPL(save_stack_trace);
 
 void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
 {
-   save_context_stack(trace, tsk->thread.ksp, tsk, 0);
+   unsigned long sp = tsk->thread.ksp;
+
+   if (tsk == current)
+   sp = current_stack_pointer();
+
+   save_context_stack(trace, sp, tsk, 0);
 }
 EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
 
-- 
2.9.3



Re: [PATCHv2 1/7] ppc bpf/jit: Disable classic BPF JIT on ppc64le

2016-06-22 Thread Thadeu Lima de Souza Cascardo
On Wed, Jun 22, 2016 at 09:55:01PM +0530, Naveen N. Rao wrote:
> Classic BPF JIT was never ported completely to work on little endian
> powerpc. However, it can be enabled and will crash the system when used.
> As such, disable use of BPF JIT on ppc64le.

Thanks, Naveen.

Acked-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>

> 
> Cc: sta...@vger.kernel.org
> Cc: Matt Evans <m...@ozlabs.org>
> Cc: Denis Kirjanov <k...@linux-powerpc.org>
> Cc: Michael Ellerman <m...@ellerman.id.au>
> Cc: Paul Mackerras <pau...@samba.org>
> Cc: Alexei Starovoitov <a...@fb.com>
> Cc: Daniel Borkmann <dan...@iogearbox.net>
> Cc: "David S. Miller" <da...@davemloft.net>
> Cc: Ananth N Mavinakayanahalli <ana...@in.ibm.com>
> Cc: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
> Reported-by: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
> Signed-off-by: Naveen N. Rao <naveen.n@linux.vnet.ibm.com>
> ---
>  arch/powerpc/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 01f7464..0a9d439 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -128,7 +128,7 @@ config PPC
>   select IRQ_FORCED_THREADING
>   select HAVE_RCU_TABLE_FREE if SMP
>   select HAVE_SYSCALL_TRACEPOINTS
> - select HAVE_CBPF_JIT
> + select HAVE_CBPF_JIT if CPU_BIG_ENDIAN
>   select HAVE_ARCH_JUMP_LABEL
>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>   select ARCH_HAS_GCOV_PROFILE_ALL
> -- 
> 2.8.2
> 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] ppc: Fix BPF JIT for ABIv2

2016-06-21 Thread Thadeu Lima de Souza Cascardo
On Tue, Jun 21, 2016 at 09:15:48PM +1000, Michael Ellerman wrote:
> On Tue, 2016-06-21 at 14:28 +0530, Naveen N. Rao wrote:
> > On 2016/06/20 03:56PM, Thadeu Lima de Souza Cascardo wrote:
> > > On Sun, Jun 19, 2016 at 11:19:14PM +0530, Naveen N. Rao wrote:
> > > > On 2016/06/17 10:00AM, Thadeu Lima de Souza Cascardo wrote:
> > > > > 
> > > > > Hi, Michael and Naveen.
> > > > > 
> > > > > I noticed independently that there is a problem with BPF JIT and 
> > > > > ABIv2, and
> > > > > worked out the patch below before I noticed Naveen's patchset and the 
> > > > > latest
> > > > > changes in ppc tree for a better way to check for ABI versions.
> > > > > 
> > > > > However, since the issue described below affect mainline and stable 
> > > > > kernels,
> > > > > would you consider applying it before merging your two patchsets, so 
> > > > > that we can
> > > > > more easily backport the fix?
> > > > 
> > > > Hi Cascardo,
> > > > Given that this has been broken on ABIv2 since forever, I didn't bother 
> > > > fixing it. But, I can see why this would be a good thing to have for 
> > > > -stable and existing distros. However, while your patch below may fix 
> > > > the crash you're seeing on ppc64le, it is not sufficient -- you'll need 
> > > > changes in bpf_jit_asm.S as well.
> > > 
> > > Hi, Naveen.
> > > 
> > > Any tips on how to exercise possible issues there? Or what changes you 
> > > think
> > > would be sufficient?
> > 
> > The calling convention is different with ABIv2 and so we'll need changes 
> > in bpf_slow_path_common() and sk_negative_common().
> 
> How big would those changes be? Do we know?
> 
> How come no one reported this was broken previously? This is the first I've
> heard of it being broken.
> 

I just heard of it less than two weeks ago, and only could investigate it last
week, when I realized mainline was also affected.

It looks like the little-endian support for classic JIT were done before the
conversion to ABIv2. And as JIT is disabled by default, no one seems to have
exercised it.

> > However, rather than enabling classic JIT for ppc64le, are we better off 
> > just disabling it?
> > 
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -128,7 +128,7 @@ config PPC
> > select IRQ_FORCED_THREADING
> > select HAVE_RCU_TABLE_FREE if SMP
> > select HAVE_SYSCALL_TRACEPOINTS
> > -   select HAVE_CBPF_JIT
> > +   select HAVE_CBPF_JIT if CPU_BIG_ENDIAN
> > select HAVE_ARCH_JUMP_LABEL
> > select ARCH_HAVE_NMI_SAFE_CMPXCHG
> > select ARCH_HAS_GCOV_PROFILE_ALL
> > 
> > 
> > Michael,
> > Let me know your thoughts on whether you intend to take this patch or 
> > Cascardo's patch for -stable before the eBPF patches. I can redo my 
> > patches accordingly.
> 
> This patch sounds like the best option at the moment for something we can
> backport. Unless the changes to fix it are minimal.
> 
> cheers
> 

With my patch only, I can run a minimal tcpdump tcp port 22 with success. It
correctly filter packets. But as pointed out, slow paths may not be taken.

I don't have strong opinions on what to apply to stable, just that it would be
nice to have something for the crash before applying all the nice changes by
Naveen.

Cascardo.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] ppc: Fix BPF JIT for ABIv2

2016-06-20 Thread Thadeu Lima de Souza Cascardo
On Sun, Jun 19, 2016 at 11:19:14PM +0530, Naveen N. Rao wrote:
> On 2016/06/17 10:00AM, Thadeu Lima de Souza Cascardo wrote:
> > On Fri, Jun 17, 2016 at 10:53:21PM +1000, Michael Ellerman wrote:
> > > On Tue, 2016-07-06 at 13:32:23 UTC, "Naveen N. Rao" wrote:
> > > > diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
> > > > b/arch/powerpc/net/bpf_jit_comp64.c
> > > > new file mode 100644
> > > > index 000..954ff53
> > > > --- /dev/null
> > > > +++ b/arch/powerpc/net/bpf_jit_comp64.c
> > > > @@ -0,0 +1,956 @@
> > > ...
> > > > +
> > > > +static void bpf_jit_fill_ill_insns(void *area, unsigned int size)
> > > > +{
> > > > +   int *p = area;
> > > > +
> > > > +   /* Fill whole space with trap instructions */
> > > > +   while (p < (int *)((char *)area + size))
> > > > +   *p++ = BREAKPOINT_INSTRUCTION;
> > > > +}
> > > 
> > > This breaks the build for some configs, presumably you're missing a 
> > > header:
> > > 
> > >   arch/powerpc/net/bpf_jit_comp64.c:30:10: error: 
> > > 'BREAKPOINT_INSTRUCTION' undeclared (first use in this function)
> > > 
> > > http://kisskb.ellerman.id.au/kisskb/buildresult/12720611/
> > > 
> > > cheers
> > 
> > Hi, Michael and Naveen.
> > 
> > I noticed independently that there is a problem with BPF JIT and ABIv2, and
> > worked out the patch below before I noticed Naveen's patchset and the latest
> > changes in ppc tree for a better way to check for ABI versions.
> > 
> > However, since the issue described below affect mainline and stable kernels,
> > would you consider applying it before merging your two patchsets, so that 
> > we can
> > more easily backport the fix?
> 
> Hi Cascardo,
> Given that this has been broken on ABIv2 since forever, I didn't bother 
> fixing it. But, I can see why this would be a good thing to have for 
> -stable and existing distros. However, while your patch below may fix 
> the crash you're seeing on ppc64le, it is not sufficient -- you'll need 
> changes in bpf_jit_asm.S as well.

Hi, Naveen.

Any tips on how to exercise possible issues there? Or what changes you think
would be sufficient?

I will see what I can find by myself, but would appreciate any help.

Regards.
Cascardo.

> 
> Regards,
> Naveen
> 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] ppc: Fix BPF JIT for ABIv2

2016-06-17 Thread Thadeu Lima de Souza Cascardo
On Fri, Jun 17, 2016 at 10:53:21PM +1000, Michael Ellerman wrote:
> On Tue, 2016-07-06 at 13:32:23 UTC, "Naveen N. Rao" wrote:
> > diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
> > b/arch/powerpc/net/bpf_jit_comp64.c
> > new file mode 100644
> > index 000..954ff53
> > --- /dev/null
> > +++ b/arch/powerpc/net/bpf_jit_comp64.c
> > @@ -0,0 +1,956 @@
> ...
> > +
> > +static void bpf_jit_fill_ill_insns(void *area, unsigned int size)
> > +{
> > +   int *p = area;
> > +
> > +   /* Fill whole space with trap instructions */
> > +   while (p < (int *)((char *)area + size))
> > +   *p++ = BREAKPOINT_INSTRUCTION;
> > +}
> 
> This breaks the build for some configs, presumably you're missing a header:
> 
>   arch/powerpc/net/bpf_jit_comp64.c:30:10: error: 'BREAKPOINT_INSTRUCTION' 
> undeclared (first use in this function)
> 
> http://kisskb.ellerman.id.au/kisskb/buildresult/12720611/
> 
> cheers

Hi, Michael and Naveen.

I noticed independently that there is a problem with BPF JIT and ABIv2, and
worked out the patch below before I noticed Naveen's patchset and the latest
changes in ppc tree for a better way to check for ABI versions.

However, since the issue described below affect mainline and stable kernels,
would you consider applying it before merging your two patchsets, so that we can
more easily backport the fix?

Thanks.
Cascardo.

---
From a984dc02b6317a1d3a3c2302385adba5227be5bd Mon Sep 17 00:00:00 2001
From: Thadeu Lima de Souza Cascardo <casca...@redhat.com>
Date: Wed, 15 Jun 2016 13:22:12 -0300
Subject: [PATCH] ppc: Fix BPF JIT for ABIv2

ABIv2 used for ppc64le does not use function descriptors. Without this patch,
whenever BPF JIT is enabled, we get a crash as below.

[root@ibm-p8-kvm-05-guest-02 ~]# echo 2 > /proc/sys/net/core/bpf_jit_enable
[root@ibm-p8-kvm-05-guest-02 ~]# tcpdump -n -i eth0 tcp port 22
device eth0 entered promiscuous mode
Pass 1: shrink = 0, seen = 0x0
Pass 2: shrink = 0, seen = 0x0
flen=1 proglen=8 pass=3 image=d5bb9018 from=tcpdump pid=11387
JIT code: : 00 00 60 38 20 00 80 4e
Pass 1: shrink = 0, seen = 0x3
Pass 2: shrink = 0, seen = 0x3
flen=20 proglen=524 pass=3 image=d5bbd018 from=tcpdump pid=11387
JIT code: : a6 02 08 7c 10 00 01 f8 70 ff c1 f9 78 ff e1 f9
JIT code: 0010: e1 fe 21 f8 7c 00 e3 80 78 00 e3 81 50 78 e7 7d
JIT code: 0020: c8 00 c3 e9 00 00 a0 38 00 c0 e0 3c c6 07 e7 78
JIT code: 0030: 08 00 e7 64 54 1b e7 60 a6 03 e8 7c 0c 00 c0 38
JIT code: 0040: 21 00 80 4e b0 01 80 41 00 00 00 60 dd 86 e0 38
JIT code: 0050: 01 00 e7 3c 40 38 04 7c 9c 00 82 40 00 00 00 60
JIT code: 0060: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 70 1b e7 60
JIT code: 0070: a6 03 e8 7c 14 00 c0 38 21 00 80 4e 78 01 80 41
JIT code: 0080: 00 00 00 60 06 00 04 28 68 01 82 40 00 00 00 60
JIT code: 0090: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 54 1b e7 60
JIT code: 00a0: a6 03 e8 7c 36 00 c0 38 21 00 80 4e 48 01 80 41
JIT code: 00b0: 00 00 00 60 16 00 04 28 2c 01 82 41 00 00 00 60
JIT code: 00c0: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 54 1b e7 60
JIT code: 00d0: a6 03 e8 7c 38 00 c0 38 21 00 80 4e 18 01 80 41
JIT code: 00e0: 00 00 00 60 16 00 04 28 fc 00 82 41 00 00 00 60
JIT code: 00f0: 00 01 00 48 00 08 04 28 f8 00 82 40 00 00 00 60
JIT code: 0100: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 70 1b e7 60
JIT code: 0110: a6 03 e8 7c 17 00 c0 38 21 00 80 4e d8 00 80 41
JIT code: 0120: 00 00 00 60 06 00 04 28 c8 00 82 40 00 00 00 60
JIT code: 0130: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 54 1b e7 60
JIT code: 0140: a6 03 e8 7c 14 00 c0 38 21 00 80 4e a8 00 80 41
JIT code: 0150: 00 00 00 60 ff 1f 87 70 98 00 82 40 00 00 00 60
JIT code: 0160: 00 c0 e0 3c c6 07 e7 78 08 00 e7 64 88 1b e7 60
JIT code: 0170: a6 03 e8 7c 0e 00 c0 38 21 00 80 4e 78 00 80 41
JIT code: 0180: 00 00 00 60 00 c0 e0 3c c6 07 e7 78 08 00 e7 64
JIT code: 0190: 4c 1b e7 60 a6 03 e8 7c 0e 00 c5 38 21 00 80 4e
JIT code: 01a0: 54 00 80 41 00 00 00 60 16 00 04 28 38 00 82 41
JIT code: 01b0: 00 00 00 60 00 c0 e0 3c c6 07 e7 78 08 00 e7 64
JIT code: 01c0: 4c 1b e7 60 a6 03 e8 7c 10 00 c5 38 21 00 80 4e
JIT code: 01d0: 24 00 80 41 00 00 00 60 16 00 04 28 14 00 82 40
JIT code: 01e0: 00 00 00 60 ff ff 60 38 01 00 63 3c 08 00 00 48
JIT code: 01f0: 00 00 60 38 20 01 21 38 10 00 01 e8 a6 03 08 7c
JIT code: 0200: 70 ff c1 e9 78 ff e1 e9 20 00 80 4e
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
Oops: Exception in kernel mode, sig: 4 [#1]
SMP NR_CPUS=32 NUMA pSeries
Modules linked in: virtio_balloon nfsd ip_tables x_tables autofs4 xfs libcrc32c 
virtio_console virtio_net virtio_pci virtio_ring virtio
CPU: 1 PID: 0 Comm: swapper/1 Not t

Re: Generic IOMMU pooled allocator

2015-03-25 Thread cascardo
On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
 From: Benjamin Herrenschmidt b...@kernel.crashing.org
 Date: Tue, 24 Mar 2015 13:08:10 +1100
 
  For the large pool, we don't keep a hint so we don't know it's
  wrapped, in fact we purposefully don't use a hint to limit
  fragmentation on it, but then, it should be used rarely enough that
  flushing always is, I suspect, a good option.
 
 I can't think of any use case where the largepool would be hit a lot
 at all.

Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
driver mapped a whole 64KiB page, it would hit the largepool.

I have been suspicious for some time that after Anton's work on the
pools, the large mappings optimization would throw away the benefit of
using the 4 pools, since some drivers would always hit the largepool.

Of course, drivers that map entire pages, when not buggy, are optimized
already to avoid calling dma_map all the time. I worked on that for
mlx4_en, and I would expect that its receive side would always hit the
largepool.

So, I decided to experiment and count the number of times that
largealloc is true versus false.

On the transmit side, or when using ICMP, I didn't notice many large
allocations with qlge or cxgb4.

However, when using large TCP send/recv (I used uperf with 64KB
writes/reads), I noticed that on the transmit side, largealloc is not
used, but on the receive side, cxgb4 almost only uses largealloc, while
qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings.
When turning GRO off, that ratio is closer to 1/10, meaning there is
still some fair use of largealloc in that scenario.

I confess my experiments are not complete. I would like to test a couple
of other drivers as well, including mlx4_en and bnx2x, and test with
small packet sizes. I suspected that MTU size could make a difference,
but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I
didn't notice any significant hit of largepool with either qlge or
cxgb4.

Also, we need to keep in mind that IOMMU_PAGE_SIZE is now dynamic in the
latest code, with plans on using 64KiB in some situations, Alexey or Ben
should have more details.

But I believe that on the receive side, all drivers should map entire
pages, using some allocation strategy similar to mlx4_en, in order to
avoid DMA mapping all the time. Some believe that is bad for latency,
and prefer to call something like skb_alloc for every package received,
but I haven't seen any hard numbers, and I don't know why we couldn't
make such an allocator as good as using something like the SLAB/SLUB
allocator. Maybe there is a jitter problem, since the allocator has to
go out and get some new pages and map them, once in a while. But I don't
see why this would not be a problem with SLAB/SLUB as well. Calling
dma_map is even worse with the current implementation. It's just that
some architectures do no work at all when dma_map/unmap is called.

Hope that helps consider the best strategy for the DMA space allocation
as of now.

Regards.
Cascardo.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/2] PCI: One more parameter to pci_set_pcie_reset_state()

2015-03-23 Thread cascardo
 will affect
other functions.

We need to either have in written that this is acceptable and that
higher layers should take care of these possibilities, if they care
about them, or we need to fix this, by making VFIO only accept a whole
group of functions to be assigned to a guest, which will allow the
problem described by Gavin to be properly handled, by using slot reset
functions, or another way of grouping devices that may be specific to
the platform.

Does it make sense?

Cascardo.

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index daa68a1..cd85c18 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -726,11 +726,14 @@ static void *eeh_restore_dev_state(void *data, 
void *userdata)
  * pcibios_set_pcie_slot_reset - Set PCI-E reset state
  * @dev: pci device struct
  * @state: reset state to enter
+ * @probe: check if the device can take the reset
  *
  * Return value:
  *   0 if success
  */
-int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum 
pcie_reset_state state)
+int pcibios_set_pcie_reset_state(struct pci_dev *dev,
+  enum pcie_reset_state state,
+  int probe)
 {
  struct eeh_dev *edev = pci_dev_to_eeh_dev(dev);
  struct eeh_pe *pe = eeh_dev_to_pe(edev);
@@ -738,9 +741,12 @@ int pcibios_set_pcie_reset_state(struct pci_dev 
*dev, enum pcie_reset_state stat
  if (!pe) {
  pr_err(%s: No PE found on PCI device %s\n,
  __func__, pci_name(dev));
- return -EINVAL;
+ return -ENOTTY;
  }
 
+ if (probe)
+ return 0;
+
  switch (state) {
  case pcie_deassert_reset:
  eeh_ops-reset(pe, EEH_RESET_DEACTIVATE);
@@ -762,8 +768,8 @@ int pcibios_set_pcie_reset_state(struct pci_dev 
*dev, enum pcie_reset_state stat
  break;
  default:
  eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
- return -EINVAL;
- };
+ return -ENOTTY;
+ }
 
  return 0;
 }
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 1ef0164..3a87bfc 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -789,7 +789,7 @@ int cxl_reset(struct cxl *adapter)
  /* pcie_warm_reset requests a fundamental pci reset which 
includes a
   * PERST assert/deassert.  PERST triggers a loading of the image
   * if user or factory is selected in sysfs */
- if ((rc = pci_set_pcie_reset_state(dev, pcie_warm_reset))) {
+ if ((rc = pci_set_pcie_reset_state(dev, pcie_warm_reset, 0))) {
  dev_err(dev-dev, cxl: pcie_warm_reset failed\n);
  return rc;
  }
diff --git a/drivers/misc/genwqe/card_base.c 
b/drivers/misc/genwqe/card_base.c
index 4cf8f82..4871f69 100644
--- a/drivers/misc/genwqe/card_base.c
+++ b/drivers/misc/genwqe/card_base.c
@@ -782,17 +782,22 @@ static int genwqe_pci_fundamental_reset(struct 
pci_dev *pci_dev)
 {
  int rc;
 
+ /* Probe if the device can take the reset */
+ rc = pci_set_pcie_reset_state(pci_dev, pcie_warm_reset, 1);
+ if (rc)
+ return rc;
+
  /*
   * lock pci config space access from userspace,
   * save state and issue PCIe fundamental reset
   */
  pci_cfg_access_lock(pci_dev);
  pci_save_state(pci_dev);
- rc = pci_set_pcie_reset_state(pci_dev, pcie_warm_reset);
+ rc = pci_set_pcie_reset_state(pci_dev, pcie_warm_reset, 0);
  if (!rc) {
  /* keep PCIe reset asserted for 250ms */
  msleep(250);
- pci_set_pcie_reset_state(pci_dev, pcie_deassert_reset);
+ pci_set_pcie_reset_state(pci_dev, pcie_deassert_reset, 
0);
  /* Wait for 2s to reload flash and train the link */
  msleep(2000);
  }
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 81f06e8..8581a5f 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1558,28 +1558,31 @@ EXPORT_SYMBOL(pci_disable_device);
  * pcibios_set_pcie_reset_state - set reset state for device dev
  * @dev: the PCIe device reset
  * @state: Reset state to enter into
- *
+ * @probe: Check if the device can take the reset
  *
  * Sets the PCIe reset state for the device. This is the default
  * implementation. Architecture implementations can override this.
  */
 int __weak pcibios_set_pcie_reset_state(struct pci_dev *dev,
- enum pcie_reset_state state)
+ enum pcie_reset_state state,
+ int

Re: [PATCH 0/4] Support registering specific reset handler

2015-02-19 Thread cascardo
On Tue, Feb 17, 2015 at 09:36:47AM +1100, Gavin Shan wrote:
 On Mon, Feb 16, 2015 at 11:14:27AM -0200, casca...@linux.vnet.ibm.com wrote:
 On Fri, Feb 13, 2015 at 03:54:55PM +1100, Gavin Shan wrote:
  VFIO PCI infrastructure depends on pci_reset_function() to do reset on
  PCI devices so that they would be in clean state when host or guest grabs
  them. Unfortunately, the function doesn't work (or not well) on some PCI
  devices that require EEH PE reset.
  
  The patchset extends the quirk for PCI device speicific reset methods to
  allow dynamically registration. With it, we can translate reset requests
  for those special PCI devcies to EEH PE reset, which is only avaialble on
  64-bits PowerPC platforms.
  
 
 Hi, Gavin.
 
 I like your approach overall. That allows us to confine these quirks to
 the platforms where they are relevant. I would make the quirks more
 specific, though, instead of doing them for all IBM and Mellanox
 devices.
 
 
 Yeah, we need have more specific vendor/device IDs for PATCH[4/4]. Could
 you please take a look on PATCH[4/4] and then suggest the specific devices
 that requries the platform-dependent reset quirk? Especially the device IDs
 for IBM/Mellanox we need put add quirks for.
 
 I wonder if we should not have some form of domain reset, where we would
 reset all the devices on the same group, and use that on vfio. Grouping
 the devices would then be made platform-dependent, as well as the reset
 method. On powernv, we would group by IOMMU group and issue a
 fundamental reset.
 
 
 I'm assuming domain reset is PE reset, which is the specific reset handler
 tries to do on PowerNV platform. The reason why we need platform specific
 reset handler is some adapters can't support function level reset methods
 (except pci_dev_specific_reset()) in __pci_dev_reset().
 

Well, in the case of Power servers, this would be the PE reset, I am not
sure what this would be on other platforms. No other platform implements
pci_set_pcie_reset_state, which I think would be one possible to
implement such a reset.

What I am saying is that we should consider doing this on all platforms
and for all adapters, because this is not specific to Power and this is
not specific to this particular set of adapters. Otherwise, one can
simply program one adapter in the guest to write to host memory,
shutdown, and since no reset will take place, the card will simply write
to host memory when the guest is finished with and IOMMU is off.

Regards.
Cascardo.

 Thanks,
 Gavin
 
 Cascardo.
 
  Gavin Shan (4):
PCI: Rename struct pci_dev_reset_methods
PCI: Introduce list for device reset methods
PCI: Allow registering reset method
powerpc/powernv: Register PCI dev specific reset handlers
  
   arch/powerpc/platforms/powernv/pci.c |  61 +++
   drivers/pci/pci.h|   3 +-
   drivers/pci/quirks.c | 139 
  ++-
   include/linux/pci.h  |   9 +++
   4 files changed, 192 insertions(+), 20 deletions(-)
  
  -- 
  1.8.3.2
  
 
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/4] Support registering specific reset handler

2015-02-16 Thread cascardo
On Fri, Feb 13, 2015 at 03:54:55PM +1100, Gavin Shan wrote:
 VFIO PCI infrastructure depends on pci_reset_function() to do reset on
 PCI devices so that they would be in clean state when host or guest grabs
 them. Unfortunately, the function doesn't work (or not well) on some PCI
 devices that require EEH PE reset.
 
 The patchset extends the quirk for PCI device speicific reset methods to
 allow dynamically registration. With it, we can translate reset requests
 for those special PCI devcies to EEH PE reset, which is only avaialble on
 64-bits PowerPC platforms.
 

Hi, Gavin.

I like your approach overall. That allows us to confine these quirks to
the platforms where they are relevant. I would make the quirks more
specific, though, instead of doing them for all IBM and Mellanox
devices.

I wonder if we should not have some form of domain reset, where we would
reset all the devices on the same group, and use that on vfio. Grouping
the devices would then be made platform-dependent, as well as the reset
method. On powernv, we would group by IOMMU group and issue a
fundamental reset.

Cascardo.

 Gavin Shan (4):
   PCI: Rename struct pci_dev_reset_methods
   PCI: Introduce list for device reset methods
   PCI: Allow registering reset method
   powerpc/powernv: Register PCI dev specific reset handlers
 
  arch/powerpc/platforms/powernv/pci.c |  61 +++
  drivers/pci/pci.h|   3 +-
  drivers/pci/quirks.c | 139 
 ++-
  include/linux/pci.h  |   9 +++
  4 files changed, 192 insertions(+), 20 deletions(-)
 
 -- 
 1.8.3.2
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH resend] powernv/iommu: disable IOMMU bypass with param iommu=nobypass

2015-01-22 Thread cascardo
On Thu, Jan 22, 2015 at 12:05:31PM +1100, Michael Ellerman wrote:
 On Wed, 2015-01-21 at 13:23 -0200, Thadeu Lima de Souza Cascardo wrote:
  When IOMMU bypass is enabled, a PCI device can read and write memory
  that was not mapped by the driver without causing an EEH. That might
  cause memory corruption, for example.
  
  When we disable bypass, DMA reads and writes to addresses not mapped by
  the IOMMU will cause an EEH, allowing us to debug such issues.
  
  Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
  Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com
  ---
   Documentation/kernel-parameters.txt   |2 ++
   arch/powerpc/platforms/powernv/pci-ioda.c |   24 +++-
   2 files changed, 25 insertions(+), 1 deletion(-)
 
 Is this unchanged since v2?
 

Yes, it's unchanged, just rebased.

Cascardo.

 It's marked as under review in patchwork which means I'm looking at it:
 
   http://patchwork.ozlabs.org/patch/402682/
 
 
 Please check patchwork in future.
 
 cheers
 
 
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH resend] powernv/iommu: disable IOMMU bypass with param iommu=nobypass

2015-01-21 Thread Thadeu Lima de Souza Cascardo
When IOMMU bypass is enabled, a PCI device can read and write memory
that was not mapped by the driver without causing an EEH. That might
cause memory corruption, for example.

When we disable bypass, DMA reads and writes to addresses not mapped by
the IOMMU will cause an EEH, allowing us to debug such issues.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 Documentation/kernel-parameters.txt   |2 ++
 arch/powerpc/platforms/powernv/pci-ioda.c |   24 +++-
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 4df73da..7dedfe5 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1493,6 +1493,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
forcesac
soft
pt  [x86, IA-64]
+   nobypass[PPC/POWERNV]
+   Disable IOMMU bypass, using IOMMU for PCI devices.
 
 
io7=[HW] IO7 for Marvel based alpha systems
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index fac88ed..f942a19 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -75,6 +75,27 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, 
const char *level,
 #define pe_info(pe, fmt, ...)  \
pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__)
 
+static bool pnv_iommu_bypass_disabled __read_mostly;
+
+static int __init iommu_setup(char *str)
+{
+   if (!str)
+   return -EINVAL;
+   while (*str) {
+   if (!strncmp(str, nobypass, 8)) {
+   pnv_iommu_bypass_disabled = true;
+   pr_info(PowerNV: IOMMU bypass window disabled.\n);
+   }
+   str += strcspn(str, ,);
+   if (*str == ',')
+   str++;
+   }
+
+   return 0;
+}
+
+early_param(iommu, iommu_setup);
+
 /*
  * stdcix is only supposed to be used in hypervisor real mode as per
  * the architecture spec
@@ -1348,7 +1369,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
pnv_ioda_setup_bus_dma(pe, pe-pbus, true);
 
/* Also create a bypass window */
-   pnv_pci_ioda2_setup_bypass_pe(phb, pe);
+   if (!pnv_iommu_bypass_disabled)
+   pnv_pci_ioda2_setup_bypass_pe(phb, pe);
return;
 fail:
if (pe-tce32_seg = 0)
-- 
1.7.10.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] powernv/iommu: disable IOMMU bypass with param iommu=nobypass

2014-10-23 Thread Thadeu Lima de Souza Cascardo
When IOMMU bypass is enabled, a PCI device can read and write memory
that was not mapped by the driver without causing an EEH. That might
cause memory corruption, for example.

When we disable bypass, DMA reads and writes to addresses not mapped by
the IOMMU will cause an EEH, allowing us to debug such issues.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 Documentation/kernel-parameters.txt   |2 ++
 arch/powerpc/platforms/powernv/pci-ioda.c |   24 +++-
 2 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 988160a..b03322a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1454,6 +1454,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
forcesac
soft
pt  [x86, IA-64]
+   nobypass[PPC/POWERNV]
+   Disable IOMMU bypass, using IOMMU for PCI devices.
 
 
io7=[HW] IO7 for Marvel based alpha systems
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 468a0f2..a7d2f32 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -75,6 +75,27 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, 
const char *level,
 #define pe_info(pe, fmt, ...)  \
pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__)
 
+static bool pnv_iommu_bypass_disabled __read_mostly;
+
+static int __init iommu_setup(char *str)
+{
+   if (!str)
+   return -EINVAL;
+   while (*str) {
+   if (!strncmp(str, nobypass, 8)) {
+   pnv_iommu_bypass_disabled = true;
+   pr_info(PowerNV: IOMMU bypass window disabled.\n);
+   }
+   str += strcspn(str, ,);
+   if (*str == ',')
+   str++;
+   }
+
+   return 0;
+}
+
+early_param(iommu, iommu_setup);
+
 /*
  * stdcix is only supposed to be used in hypervisor real mode as per
  * the architecture spec
@@ -1243,7 +1264,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
pnv_ioda_setup_bus_dma(pe, pe-pbus, true);
 
/* Also create a bypass window */
-   pnv_pci_ioda2_setup_bypass_pe(phb, pe);
+   if (!pnv_iommu_bypass_disabled)
+   pnv_pci_ioda2_setup_bypass_pe(phb, pe);
return;
 fail:
if (pe-tce32_seg = 0)
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powernv/iommu: disable IOMMU bypass with param iommu=nobypass

2014-10-21 Thread Thadeu Lima de Souza Cascardo
When IOMMU bypass is enabled, a PCI device can read and write memory
that was not mapped by the driver without causing an EEH. That might
cause memory corruption, for example.

When we disable bypass, DMA reads and writes to addresses not mapped by
the IOMMU will cause an EEH, allowing us to debug such issues.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
 Documentation/kernel-parameters.txt   |2 ++
 arch/powerpc/platforms/powernv/pci-ioda.c |   24 +++-
 2 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 988160a..b03322a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1454,6 +1454,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
forcesac
soft
pt  [x86, IA-64]
+   nobypass[PPC/POWERNV]
+   Disable IOMMU bypass, using IOMMU for PCI devices.
 
 
io7=[HW] IO7 for Marvel based alpha systems
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 468a0f2..58a5a27 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -75,6 +75,27 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, 
const char *level,
 #define pe_info(pe, fmt, ...)  \
pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__)
 
+static int __read_mostly disable_bypass;
+
+static int __init iommu_setup(char *str)
+{
+   if (!*str)
+   return -EINVAL;
+   while (*str) {
+   if (!strncmp(str, nobypass, 8)) {
+   disable_bypass = 1;
+   pr_info(ppc iommu: disabling bypass.\n);
+   }
+   str += strcspn(str, ,);
+   if (*str == ',')
+   str++;
+   }
+
+   return 0;
+}
+
+early_param(iommu, iommu_setup);
+
 /*
  * stdcix is only supposed to be used in hypervisor real mode as per
  * the architecture spec
@@ -1243,7 +1264,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
pnv_ioda_setup_bus_dma(pe, pe-pbus, true);
 
/* Also create a bypass window */
-   pnv_pci_ioda2_setup_bypass_pe(phb, pe);
+   if (!disable_bypass)
+   pnv_pci_ioda2_setup_bypass_pe(phb, pe);
return;
 fail:
if (pe-tce32_seg = 0)
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] deb-pkg: Add support for powerpc little endian

2014-09-05 Thread Thadeu Lima de Souza Cascardo
On Fri, Sep 05, 2014 at 05:55:18PM +1000, Michael Neuling wrote:
 On Fri, 2014-09-05 at 09:13 +0200, Gabriel Paubert wrote:
  On Fri, Sep 05, 2014 at 03:28:47PM +1000, Michael Neuling wrote:
   The Debian powerpc little endian architecture is called ppc64le.  This
  
  Huh? ppc64le or ppc64el?
 
 ppc64el.  Commit message is wrong.  Fixed below.
 
 Mikey
 
 

What about ppc64?

Also, I sent that already a month ago. Both linuxppc-dev and Michal
Marek were on cc.

http://marc.info/?l=linux-kernelm=140744360328562w=2

Cascardo.

 From: Michael Neuling mi...@neuling.org
 
 deb-pkg: Add support for powerpc little endian
 
 The Debian powerpc little endian architecture is called ppc64el.  This
 is the default architecture used by Ubuntu for powerpc.
 
 The below checks the kernel config to see if we are compiling little
 endian and sets the Debian arch appropriately.
 
 Signed-off-by: Michael Neuling mi...@neuling.org
 
 diff --git a/scripts/package/builddeb b/scripts/package/builddeb
 index 35d5a58..6f4a1af 100644
 --- a/scripts/package/builddeb
 +++ b/scripts/package/builddeb
 @@ -37,7 +37,7 @@ create_package() {
   s390*)
   debarch=s390$(grep -q CONFIG_64BIT=y $KCONFIG_CONFIG  echo x 
 || true) ;;
   ppc*)
 - debarch=powerpc ;;
 + debarch=$(grep -q CPU_LITTLE_ENDIAN=y $KCONFIG_CONFIG  echo 
 ppc64el || echo powerpc) ;;
   parisc*)
   debarch=hppa ;;
   mips*)
 
 
 
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] deb-pkg: support ppc64 and ppc64el Debian archs

2014-08-07 Thread Thadeu Lima de Souza Cascardo
Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
 scripts/package/builddeb |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/scripts/package/builddeb b/scripts/package/builddeb
index 35d5a58..c26c28b 100644
--- a/scripts/package/builddeb
+++ b/scripts/package/builddeb
@@ -36,6 +36,8 @@ create_package() {
debarch=sparc ;;
s390*)
debarch=s390$(grep -q CONFIG_64BIT=y $KCONFIG_CONFIG  echo x 
|| true) ;;
+   ppc64*)
+   debarch=ppc64$(grep -q CPU_LITTLE_ENDIAN=y $KCONFIG_CONFIG  
echo el || true) ;;
ppc*)
debarch=powerpc ;;
parisc*)
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH] vfio-pci: avoid deadlock between unbind and VFIO_DEVICE_RESET

2014-03-03 Thread Thadeu Lima de Souza Cascardo
When we unbind vfio-pci from a device, while running a guest, we might
have a deadlock when such a guest reboots.

Unbind takes device_lock at device_release_driver, and waits for
release_q at vfio_del_group_dev.

release_q will only be woken up when all references to vfio_device are
gone, and that includes open file descriptors, like the ones a guest
on qemu will hold.

If you try to reboot the guest, it will call VFIO_DEVICE_RESET, which
calls pci_reset_function, which now grabs the device_lock, and we are
deadlocked.

Using device_trylock allow us to handle the case when the lock is
already taken, and avoid this situation.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---

Not tested yet, but I would like some comments now, like would it be
better to have a pci_try_reset_function, or do trylock on
pci_reset_function itself?

---
 drivers/vfio/pci/vfio_pci.c |   14 --
 1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 3b76dc8..d1d2242 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -513,8 +513,18 @@ static long vfio_pci_ioctl(void *device_data,
return ret;
 
} else if (cmd == VFIO_DEVICE_RESET) {
-   return vdev-reset_works ?
-   pci_reset_function(vdev-pdev) : -EINVAL;
+   struct pci_dev *pdev = vdev-pdev;
+   int ret = -EBUSY;
+   if (!vdev-reset_works)
+   return -EINVAL;
+   if (pci_cfg_access_trylock(pdev)) {
+   if (device_trylock(pdev-dev)) {
+   ret = __pci_reset_function_locked(pdev);
+   device_unlock(pdev-dev);
+   }
+   pci_cfg_access_unlock(pdev);
+   }
+   return ret;
 
} else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) {
struct vfio_pci_hot_reset_info hdr;
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] vfio-pci: avoid deadlock between unbind and VFIO_DEVICE_RESET

2014-03-03 Thread Thadeu Lima de Souza Cascardo
On Mon, Mar 03, 2014 at 08:09:22AM -0700, Alex Williamson wrote:
 On Mon, 2014-03-03 at 11:33 -0300, Thadeu Lima de Souza Cascardo wrote:
  When we unbind vfio-pci from a device, while running a guest, we might
  have a deadlock when such a guest reboots.
  
  Unbind takes device_lock at device_release_driver, and waits for
  release_q at vfio_del_group_dev.
  
  release_q will only be woken up when all references to vfio_device are
  gone, and that includes open file descriptors, like the ones a guest
  on qemu will hold.
  
  If you try to reboot the guest, it will call VFIO_DEVICE_RESET, which
  calls pci_reset_function, which now grabs the device_lock, and we are
  deadlocked.
  
  Using device_trylock allow us to handle the case when the lock is
  already taken, and avoid this situation.
  
  Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
  ---
  
  Not tested yet, but I would like some comments now, like would it be
  better to have a pci_try_reset_function, or do trylock on
  pci_reset_function itself?
 
 
 We already have it:
 
 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61cf16d8bd38c3dc52033ea75d5b1f8368514a17
 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=890ed578df82f5b7b5a874f9f2fa4f117305df5f
 
 Is there something insufficient about these or are you testing on and
 older kernel?  Thanks,
 
 Alex

Sorry I missed it. On the rush to report and fix it, I looked only on my
local branch. Should we backport those two patches to long term stable
3.10? I can reproduce the issue there.

Thanks.
Cascardo.

 
 
  ---
   drivers/vfio/pci/vfio_pci.c |   14 --
   1 files changed, 12 insertions(+), 2 deletions(-)
  
  diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
  index 3b76dc8..d1d2242 100644
  --- a/drivers/vfio/pci/vfio_pci.c
  +++ b/drivers/vfio/pci/vfio_pci.c
  @@ -513,8 +513,18 @@ static long vfio_pci_ioctl(void *device_data,
  return ret;
   
  } else if (cmd == VFIO_DEVICE_RESET) {
  -   return vdev-reset_works ?
  -   pci_reset_function(vdev-pdev) : -EINVAL;
  +   struct pci_dev *pdev = vdev-pdev;
  +   int ret = -EBUSY;
  +   if (!vdev-reset_works)
  +   return -EINVAL;
  +   if (pci_cfg_access_trylock(pdev)) {
  +   if (device_trylock(pdev-dev)) {
  +   ret = __pci_reset_function_locked(pdev);
  +   device_unlock(pdev-dev);
  +   }
  +   pci_cfg_access_unlock(pdev);
  +   }
  +   return ret;
   
  } else if (cmd == VFIO_DEVICE_GET_PCI_HOT_RESET_INFO) {
  struct vfio_pci_hot_reset_info hdr;
 
 
 
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] powerpc/eeh: drop taken reference to driver on eeh_rmv_device

2014-02-05 Thread Thadeu Lima de Souza Cascardo
Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 (powerpc/eeh: Use
partial hotplug for EEH unaware drivers) introduces eeh_rmv_device,
which may grab a reference to a driver, but not release it.

That prevents a driver from being removed after it has gone through EEH
recovery.

This patch drops the reference if it was taken.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
Acked-by: Gavin Shan sha...@linux.vnet.ibm.com
---
 arch/powerpc/kernel/eeh_driver.c |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 7bb30dc..fdc679d 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -362,9 +362,13 @@ static void *eeh_rmv_device(void *data, void *userdata)
 */
if (!dev || (dev-hdr_type  PCI_HEADER_TYPE_BRIDGE))
return NULL;
+
driver = eeh_pcid_get(dev);
-   if (driver  driver-err_handler)
-   return NULL;
+   if (driver) {
+   eeh_pcid_put(dev);
+   if (driver-err_handler)
+   return NULL;
+   }
 
/* Remove it from PCI subsystem */
pr_debug(EEH: Removing %s without EEH sensitive driver\n,
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] powerpc/eeh: drop taken reference to driver on eeh_rmv_device

2014-02-05 Thread Thadeu Lima de Souza Cascardo
On Wed, Feb 05, 2014 at 10:43:38AM -0800, Nishanth Aravamudan wrote:
 On 05.02.2014 [16:20:45 -0200], Thadeu Lima de Souza Cascardo wrote:
  Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 (powerpc/eeh: Use
  partial hotplug for EEH unaware drivers) introduces eeh_rmv_device,
  which may grab a reference to a driver, but not release it.
  
  That prevents a driver from being removed after it has gone through EEH
  recovery.
  
  This patch drops the reference if it was taken.
  
  Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
  Acked-by: Gavin Shan sha...@linux.vnet.ibm.com
  ---
   arch/powerpc/kernel/eeh_driver.c |8 ++--
   1 files changed, 6 insertions(+), 2 deletions(-)
  
  diff --git a/arch/powerpc/kernel/eeh_driver.c 
  b/arch/powerpc/kernel/eeh_driver.c
  index 7bb30dc..fdc679d 100644
  --- a/arch/powerpc/kernel/eeh_driver.c
  +++ b/arch/powerpc/kernel/eeh_driver.c
  @@ -362,9 +362,13 @@ static void *eeh_rmv_device(void *data, void *userdata)
   */
  if (!dev || (dev-hdr_type  PCI_HEADER_TYPE_BRIDGE))
  return NULL;
  +
 
 This appears to be unnecessary whitespace change?
 
 -Nish
 

Hi, Nish.

I originally add it there for readability, giving both more evidence to
the code below, where a driver reference is get and put, and to the code
above and its respective comment. Leaving those together could give the
impression the comment also applies to the code below.

But I have no strong feelings about that.

Ben, do you want me to send a new version?

Regards.
Cascardo.

  driver = eeh_pcid_get(dev);
  -   if (driver  driver-err_handler)
  -   return NULL;
  +   if (driver) {
  +   eeh_pcid_put(dev);
  +   if (driver-err_handler)
  +   return NULL;
  +   }
  
  /* Remove it from PCI subsystem */
  pr_debug(EEH: Removing %s without EEH sensitive driver\n,
  -- 
  1.7.1
  
  ___
  Linuxppc-dev mailing list
  Linuxppc-dev@lists.ozlabs.org
  https://lists.ozlabs.org/listinfo/linuxppc-dev
  
 
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Oops on shutdown : 3.11.0-5-powerpc-e500mc Ubuntu

2014-02-04 Thread Thadeu Lima de Souza Cascardo
On Tue, Feb 04, 2014 at 08:47:14AM -0600, John Donnelly wrote:
 Hi,
  Where is a appropriate place to file a bug for this : This is the second
 occurrence I have seen.
 

If this is a Ubuntu kernel, probably report it at Canonical's Launchpad.
If it is provided by Freescale, report it to Freescale.
If you can run a mainline kernel (best now would be 3.14-rc1) and
reproduce that, then best place would be linux-...@vger.kernel.org.

Hope that helps.
Cascardo.

 
 
 saucy)root@(none):/work/svy/tools#  * Stopping MD monitoring service mdadm
 --monitor
  [ OK ]
  * Asking all remaining processes to terminate...[
 OK ]
  * All processes ended within 1 seconds...   [
 OK ]
 [65426.692405] Oops: Exception in kernel mode, sig: 11 [#1]
 [65426.697721] SMP NR_CPUS=8 P4080 DS
 [65426.701122] Modules linked in: btrfs(F) raid6_pq(F) zlib_deflate(F)
 xor(F) ufs(F) qnx4(F) hfsplus(F) hfs(F) minix(F) ntfs(F) msdos(F) jfs(F)
 xfs(F) libcrc32c(F) reiserfs(F) nls_iso8859_1(F) nls_cp437(F) vfat(F)
 fat(F) dm_multipath(F) scsi_dh(F) usb_storage(F) ext2(F) rpcsec_gss_krb5(F)
 nfsv4(F) nfsd(F) auth_rpcgss(F) nfs_acl(F) nfs(F) lockd(F) sunrpc(F)
 fscache(F) ofpart(F) redboot(F) cmdlinepart(F) cfi_cmdset_0002(F)
 cfi_probe(F) cfi_util(F) gen_probe(F) physmap_of(F) map_funcs(F) chipreg(F)
 mtd(F) lm90(F) at24(F) caam(F) i2c_mpc(F) mpc85xx_edac(F) edac_core(F)
 uio_pdrv_genirq(F) uio(F) ata_generic(F) ahci(F) libahci(F) pata_jmicron(F)
 fsl_pq_mdio(F)
 [65426.759065] CPU: 6 PID: 19006 Comm: umount.nfs Tainted: GF
  3.11.0-5-powerpc-e500mc #8-Ubuntu
 [65426.768461] task: ea0f3900 ti: ea05c000 task.ti: ea05c000
 [65426.773858] NIP: f99a0f50 LR: f998250c CTR: c07f85c0
 [65426.778820] REGS: ea05dda0 TRAP: 0700   Tainted: GF
 (3.11.0-5-powerpc-e500mc)
 [65426.786998] MSR: 00029002 CE,EE,ME  CR: 24000422  XER: 2000
 [65426.793107]
 GPR00: f9982d20 ea05de50 ea0f3900 e7e66ee0 00029002 0001 
 
 GPR08: eadc2504   0001 c07f85c0 1002e3ac 1001
 8000
 GPR16: 100112e4 1000f3c4 1000f3e0 1000f594 1000f59c 1000f440 c0b4fb6c
 1001
 GPR24: 1001  c0089180 ea05de84 f99ab064 ea2bf500 c0b41380
 ea54ef00
 [65426.822949] NIP [f99a0f50] rpc_remove_client_dir+0x0/0x30 [sunrpc]
 [65426.829143] LR [f998250c] __rpc_clnt_remove_pipedir+0x5c/0x80 [sunrpc]
 [65426.835668] Call Trace:
 [65426.838128] [ea05de50] [f999f9f0] rpc_get_sb_net+0x70/0xa0 [sunrpc]
 (unreliable)
 [65426.845542] [ea05de60] [f9982d20] rpc_free_client+0x50/0xa0 [sunrpc]
 [65426.851910] [ea05de70] [f9982c58] rpc_shutdown_client+0x68/0xe0 [sunrpc]
 [65426.858647] [ea05dec0] [f9a3fd04] nfs_free_server+0x124/0x190 [nfs]
 [65426.864920] [ea05dee0] [c01c5f34] deactivate_locked_super+0x74/0xa0
 [65426.871193] [ea05def0] [c01e9184] SyS_umount+0x94/0x3d0
 [65426.876422] [ea05df40] [c0010604] ret_from_syscall+0x0/0x3c
 [65426.882002] --- Exception: c01 at 0xfec55dc
 [65426.882002] LR = 0xff88d54
 [65426.889225] Instruction dump:
 [65426.892191] 1b50 aa32 0706 0120 aa36 0706
 1af8 aa3a
 [65426.899964] 0706 1aa0 aa46 0704 0058 aa4a
 0704 1b74
 [65426.907949] udlfb: /dev/fb0 FB_BLANK mode 1 -- 0
 [65426.912666] udlfb: set_par mode 1024x768
 [65426.927196] ---[ end trace b25b3039b99e0391 ]---
 [65426.931816]
 umount.nfs: /work: not mounted
 [65426.960296] init: idmapd main process (2394) killed by TERM signal
 [65426.983998] init: systemd-logind main process (544) killed by TERM signal
 [65426.994070] Oops: Exception in kernel mode, sig: 11 [#2]
 [65426.999386] SMP NR_CPUS=8 P4080 DS
 [65427.002787] Modules linked in: btrfs(F) raid6_pq(F) zlib_deflate(F)
 xor(F) ufs(F) qnx4(F) hfsplus(F) hfs(F) minix(F) ntfs(F) msdos(F) jfs(F)
 xfs(F) libcrc32c(F) reiserfs(F) nls_iso8859_1(F) nls_cp437(F) vfat(F)
 fat(F) dm_multipath(F) scsi_dh(F) usb_storage(F) ext2(F) rpcsec_gss_krb5(F)
 nfsv4(F) nfsd(F) auth_rpcgss(F) nfs_acl(F) nfs(F) lockd(F) sunrpc(F)
 fscache(F) ofpart(F) redboot(F) cmdlinepart(F) cfi_cmdset_0002(F)
 cfi_probe(F) cfi_util(F) gen_probe(F) physmap_of(F) map_funcs(F) chipreg(F)
 mtd(F) lm90(F) at24(F) caam(F) i2c_mpc(F) mpc85xx_edac(F) edac_core(F)
 uio_pdrv_genirq(F) uio(F) ata_generic(F) ahci(F) libahci(F) pata_jmicron(F)
 fsl_pq_mdio(F)
 [65427.060734] CPU: 7 PID: 19014 Comm: umount Tainted: GF D
  3.11.0-5-powerpc-e500mc #8-Ubuntu
 [65427.069783] task: e5d55580 ti: ea2f task.ti: ea2f
 [65427.075180] NIP: f99a0270 LR: c01c5f34 CTR: f99a0270
 [65427.080142] REGS: ea2f1e30 TRAP: 0700   Tainted: GF D
 (3.11.0-5-powerpc-e500mc)
 [65427.088320] MSR: 00029002 CE,EE,ME  CR: 24004424  XER: 
 [65427.094430]
 GPR00: c01c5f24 ea2f1ee0 e5d55580 ea1e7800 c1d404e0 ea027a80 c01e7b78
 0001
 GPR08:  f99a0270  000d 003fe0f9 100192e0 1001
 8000
 GPR16: 100112e4 1000f3c4 1000f3e0 1000f594 1000f59c ea027aa0 c0b4fb6c
 ea2f1ef8
 GPR24:  ea027a80  ea027a80 ea027aa0

Re: [PATCH] powerpc/eeh: drop taken reference to driver on eeh_rmv_device

2014-01-31 Thread Thadeu Lima de Souza Cascardo
On Fri, Jan 31, 2014 at 08:46:11AM +0800, Gavin Shan wrote:
 On Thu, Jan 30, 2014 at 11:00:48AM -0200, Thadeu Lima de Souza Cascardo wrote:
 Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 (powerpc/eeh: Use
 partial hotplug for EEH unaware drivers) introduces eeh_rmv_device,
 which may grab a reference to a driver, but not release it.
 
 That prevents a driver from being removed after it has gone through EEH
 recovery.
 
 This patch drops the reference in either exit path if it was taken.
 
 Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
 ---
  arch/powerpc/kernel/eeh_driver.c |5 -
  1 files changed, 4 insertions(+), 1 deletions(-)
 
 diff --git a/arch/powerpc/kernel/eeh_driver.c 
 b/arch/powerpc/kernel/eeh_driver.c
 index 7bb30dc..afe7337 100644
 --- a/arch/powerpc/kernel/eeh_driver.c
 +++ b/arch/powerpc/kernel/eeh_driver.c
 @@ -364,7 +364,7 @@ static void *eeh_rmv_device(void *data, void *userdata)
  return NULL;
  driver = eeh_pcid_get(dev);
  if (driver  driver-err_handler)
 -return NULL;
 +goto out;
 
  /* Remove it from PCI subsystem */
  pr_debug(EEH: Removing %s without EEH sensitive driver\n,
 @@ -377,6 +377,9 @@ static void *eeh_rmv_device(void *data, void *userdata)
 
 For normal case (driver without EEH support), we probably release the 
 reference
 to the driver before pci_stop_and_remove_bus_device().

You are right, we need to call it before we call
pci_stop_and_remove_bus_device, otherwise dev-driver will be NULL, and
eeh_pcid_put will not do module_put. On the other hand, we could change
the call to eeh_pcid_put to accept struct pci_driver instead.

 
  pci_stop_and_remove_bus_device(dev);
  pci_unlock_rescan_remove();
 
 +out:
 +if (driver)
 +eeh_pcid_put(dev);
  return NULL;
 
 We needn't if (driver) here as eeh_pcid_put() already had the check.
 

What if try_module_get returned false on eeh_pcid_get?

How about something like the patch below?

  }
 
 
 Thanks,
 Gavin
---
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 7bb30dc..3a397fa 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -352,6 +352,7 @@ static void *eeh_rmv_device(void *data, void *userdata)
struct eeh_dev *edev = (struct eeh_dev *)data;
struct pci_dev *dev = eeh_dev_to_pci_dev(edev);
int *removed = (int *)userdata;
+   bool has_err_handler;
 
/*
 * Actually, we should remove the PCI bridges as well.
@@ -362,8 +363,12 @@ static void *eeh_rmv_device(void *data, void *userdata)
 */
if (!dev || (dev-hdr_type  PCI_HEADER_TYPE_BRIDGE))
return NULL;
+
driver = eeh_pcid_get(dev);
-   if (driver  driver-err_handler)
+   has_err_handler = driver  driver-err_handler;
+   if (driver)
+   eeh_pcid_put(dev);
+   if (has_err_handler)
return NULL;
 
/* Remove it from PCI subsystem */
---

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] powerpc/eeh: drop taken reference to driver on eeh_rmv_device

2014-01-30 Thread Thadeu Lima de Souza Cascardo
Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 (powerpc/eeh: Use
partial hotplug for EEH unaware drivers) introduces eeh_rmv_device,
which may grab a reference to a driver, but not release it.

That prevents a driver from being removed after it has gone through EEH
recovery.

This patch drops the reference in either exit path if it was taken.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
 arch/powerpc/kernel/eeh_driver.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 7bb30dc..afe7337 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -364,7 +364,7 @@ static void *eeh_rmv_device(void *data, void *userdata)
return NULL;
driver = eeh_pcid_get(dev);
if (driver  driver-err_handler)
-   return NULL;
+   goto out;
 
/* Remove it from PCI subsystem */
pr_debug(EEH: Removing %s without EEH sensitive driver\n,
@@ -377,6 +377,9 @@ static void *eeh_rmv_device(void *data, void *userdata)
pci_stop_and_remove_bus_device(dev);
pci_unlock_rescan_remove();
 
+out:
+   if (driver)
+   eeh_pcid_put(dev);
return NULL;
 }
 
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] powernv: fix VFIO support with PHB3

2013-12-09 Thread Thadeu Lima de Souza Cascardo
I have recently found out that no iommu_groups could be found under
/sys/ on a P8. That prevents PCI passthrough from working.

During my investigation, I found out there seems to be a missing
iommu_register_group for PHB3. The following patch seems to fix the
problem. After applying it, I see iommu_groups under
/sys/kernel/iommu_groups/, and can also bind vfio-pci to an adapter,
which gives me a device at /dev/vfio/.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---

This is now applied on top of benh's tree, branch next.

Alexey, is this now OK for you?

Thanks.
Cascardo.

---
 arch/powerpc/platforms/powernv/pci-ioda.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 614356c..f0e6871 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -720,6 +720,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
tbl-it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE;
}
iommu_init_table(tbl, phb-hose-node);
+   iommu_register_group(tbl, pci_domain_nr(pe-pbus), pe-pe_number);
 
if (pe-pdev)
set_iommu_table_base_and_group(pe-pdev-dev, tbl);
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powernv: fix VFIO support with PHB3

2013-12-07 Thread Thadeu Lima de Souza Cascardo
On Sat, Dec 07, 2013 at 11:58:44AM +1100, Alexey Kardashevskiy wrote:
 On 12/06/2013 11:21 PM, Thadeu Lima de Souza Cascardo wrote:
  I have recently found out that no iommu_groups could be found under
  /sys/ on a P8. That prevents PCI passthrough from working.
  
  During my investigation, I found out there seems to be a missing
  iommu_register_group for PHB3. The following patch seems to fix the
  problem. After applying it, I see iommu_groups under
  /sys/kernel/iommu_groups/, and can also bind vfio-pci to an adapter,
  which gives me a device at /dev/vfio/.
  
  Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
  ---
   arch/powerpc/platforms/powernv/pci-ioda.c |1 +
   1 files changed, 1 insertions(+), 0 deletions(-)
  
  diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
  b/arch/powerpc/platforms/powernv/pci-ioda.c
  index 084cdfa..2c6d173 100644
  --- a/arch/powerpc/platforms/powernv/pci-ioda.c
  +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
  @@ -720,6 +720,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
  *phb,
  tbl-it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE;
  }
  iommu_init_table(tbl, phb-hose-node);
  +   iommu_register_group(tbl, pci_domain_nr(pe-pbus), pe-pe_number);
   
  if (pe-pdev)
  set_iommu_table_base(pe-pdev-dev, tbl);
 
 
 This does not seem absolutely right - normally set_iommu_table_base() is
 replaced with set_iommu_table_base_and_group() or you will not see some
 devices in a group and this may make VFIO unhappy.

Alexey,

Your patch [PATCH v9] PPC: POWERNV: move iommu_add_device earlier is
not upstream yet. It calls set_iommu_table_base_and_group properly.
However, without calling iommu_register_group, there is no group to add
the device to.

Compare to pnv_pci_ioda_setup_dma_pe which calls iommu_register_group
too. My patch fixes this discrepancy between IODA and IODA2.

Regards.
Cascardo.

 
 But - if every single device gets assigned to some group, then we can push
 this to Frobisher and let people test in on power8.
 
 
 
 -- 
 Alexey

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] powernv: fix VFIO support with PHB3

2013-12-06 Thread Thadeu Lima de Souza Cascardo
I have recently found out that no iommu_groups could be found under
/sys/ on a P8. That prevents PCI passthrough from working.

During my investigation, I found out there seems to be a missing
iommu_register_group for PHB3. The following patch seems to fix the
problem. After applying it, I see iommu_groups under
/sys/kernel/iommu_groups/, and can also bind vfio-pci to an adapter,
which gives me a device at /dev/vfio/.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
 arch/powerpc/platforms/powernv/pci-ioda.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 084cdfa..2c6d173 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -720,6 +720,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
tbl-it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE;
}
iommu_init_table(tbl, phb-hose-node);
+   iommu_register_group(tbl, pci_domain_nr(pe-pbus), pe-pe_number);
 
if (pe-pdev)
set_iommu_table_base(pe-pdev-dev, tbl);
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc/iommu: use GFP_KERNEL instead of GFP_ATOMIC in iommu_init_table()

2013-10-02 Thread Thadeu Lima de Souza Cascardo
On Tue, Oct 01, 2013 at 02:04:53PM -0700, Nishanth Aravamudan wrote:
 Under heavy (DLPAR?) stress, we tripped this panic() in
 arch/powerpc/kernel/iommu.c::iommu_init_table():
 
   page = alloc_pages_node(nid, GFP_ATOMIC, get_order(sz));
   if (!page)
   panic(iommu_init_table: Can't allocate %ld bytes\n,
 sz);
 
 Before the panic() we got a page allocation failure for an order-2
 allocation. There appears to be memory free, but perhaps not in the
 ATOMIC context. I looked through all the call-sites of
 iommu_init_table() and didn't see any obvious reason to need an ATOMIC
 allocation. Most call-sites in fact have an explicit GFP_KERNEL
 allocation shortly before the call to iommu_init_table(), indicating we
 are not in an atomic context. There is some indirection for some paths,
 but I didn't see any locks indicating that GFP_KERNEL is inappropriate.
 
 With this change under the same conditions, we have not been able to
 reproduce the panic.
 
 Signed-off-by: Nishanth Aravamudan n...@us.ibm.com
 
 diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
 index 0adab06..572bb5b 100644
 --- a/arch/powerpc/kernel/iommu.c
 +++ b/arch/powerpc/kernel/iommu.c
 @@ -661,7 +661,7 @@ struct iommu_table *iommu_init_table(struct iommu_table 
 *tbl, int nid)
   /* number of bytes needed for the bitmap */
   sz = BITS_TO_LONGS(tbl-it_size) * sizeof(unsigned long);
 
 - page = alloc_pages_node(nid, GFP_ATOMIC, get_order(sz));
 + page = alloc_pages_node(nid, GFP_KERNEL, get_order(sz));
   if (!page)
   panic(iommu_init_table: Can't allocate %ld bytes\n, sz);
   tbl-it_map = page_address(page);

I didn't respond to the previous message, but also checked if there were
any history on the logs, and found this was as it is from the start. I
also found no other reasons why it needs to be atomic. Therefore,

Acked-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] KVM: PPC: Book3S PR: return appropriate error when allocation fails

2013-07-17 Thread Thadeu Lima de Souza Cascardo
err was overwritten by a previous function call, and checked to be 0. If
the following page allocation fails, 0 is going to be returned instead
of -ENOMEM.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_pr.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 19498a5..c6e13d9 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -1047,11 +1047,12 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm 
*kvm, unsigned int id)
if (err)
goto free_shadow_vcpu;
 
+   err = -ENOMEM;
p = __get_free_page(GFP_KERNEL|__GFP_ZERO);
-   /* the real shared page fills the last 4k of our page */
-   vcpu-arch.shared = (void*)(p + PAGE_SIZE - 4096);
if (!p)
goto uninit_vcpu;
+   /* the real shared page fills the last 4k of our page */
+   vcpu-arch.shared = (void *)(p + PAGE_SIZE - 4096);
 
 #ifdef CONFIG_PPC_BOOK3S_64
/* default to book3s_64 (970fx) */
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] eeh: add eeh_dev to the cache during boot

2013-06-27 Thread Thadeu Lima de Souza Cascardo
commit f8f7d63fd96ead101415a1302035137a866f8998 (powerpc/eeh: Trace eeh
device from I/O cache) broke EEH on pseries for devices that were
present during boot and have not been hotplugged/DLPARed.

eeh_check_failure will get the eeh_dev from the cache, and will get
NULL. eeh_addr_cache_build adds the addresses to the cache, but eeh_dev
for the giving pci_device is not set yet. Just reordering the call to
eeh_addr_cache_insert_dev works fine. The ordering is similar to the one
in eeh_add_device_late.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---

Note that this is broken since v3.7-rc1 and this patch applies on top of
v3.10-rc7. Just changing the patches will apply it on top of
next-20130626, where it has moved to arch/powerpc/kernel/.

I also realized when writing the log that maybe calling
eeh_add_device_late instead could be a better option. Please, share your
comments.

---
 arch/powerpc/platforms/pseries/eeh_cache.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/eeh_cache.c 
b/arch/powerpc/platforms/pseries/eeh_cache.c
index 5a4c879..5ce3ba7 100644
--- a/arch/powerpc/platforms/pseries/eeh_cache.c
+++ b/arch/powerpc/platforms/pseries/eeh_cache.c
@@ -294,8 +294,6 @@ void __init eeh_addr_cache_build(void)
spin_lock_init(pci_io_addr_cache_root.piar_lock);
 
for_each_pci_dev(dev) {
-   eeh_addr_cache_insert_dev(dev);
-
dn = pci_device_to_OF_node(dev);
if (!dn)
continue;
@@ -308,6 +306,8 @@ void __init eeh_addr_cache_build(void)
dev-dev.archdata.edev = edev;
edev-pdev = dev;
 
+   eeh_addr_cache_insert_dev(dev);
+
eeh_sysfs_add_device(dev);
}
 
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/2] ppc64: Add arch-specific pcie_get_speed_cap_mask

2013-03-14 Thread Thadeu Lima de Souza Cascardo
On Thu, Mar 14, 2013 at 02:45:47PM -0300, luca...@linux.vnet.ibm.com wrote:
 From: Lucas Kannebley Tavares luca...@linux.vnet.ibm.com
 
 Betters support for gen2 speed detections on PCI buses on ppc64
 architectures.
 
 Signed-off-by: Lucas Kannebley Tavares luca...@linux.vnet.ibm.com
 ---
  arch/powerpc/platforms/pseries/pci.c |   32 
  1 files changed, 32 insertions(+), 0 deletions(-)
 
 diff --git a/arch/powerpc/platforms/pseries/pci.c 
 b/arch/powerpc/platforms/pseries/pci.c
 index 0b580f4..58469fe 100644
 --- a/arch/powerpc/platforms/pseries/pci.c
 +++ b/arch/powerpc/platforms/pseries/pci.c
 @@ -24,6 +24,7 @@
  #include linux/kernel.h
  #include linux/pci.h
  #include linux/string.h
 +#include linux/device.h
 
  #include asm/eeh.h
  #include asm/pci-bridge.h
 @@ -108,3 +109,34 @@ static void fixup_winbond_82c105(struct pci_dev* dev)
  }
  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_WINBOND, PCI_DEVICE_ID_WINBOND_82C105,
fixup_winbond_82c105);
 +
 +int pcibios_get_speed_cap_mask(struct pci_dev *dev, u32 *mask)
 +{
 + struct device_node *dn, *pdn;
 + const uint32_t *pcie_link_speed_stats = NULL;
 +
 + *mask = 0;
 + dn = pci_bus_to_OF_node(dev-bus);
 +
 + /* Find nearest ibm,pcie-link-speed-stats, walking up the device tree */
 + for (pdn = dn; pdn != NULL; pdn = pdn-parent) {
 + pcie_link_speed_stats = (const uint32_t *) of_get_property(pdn,
 + ibm,pcie-link-speed-stats, NULL);
 + if (pcie_link_speed_stats != NULL)
 + break;
 + }
 +
 + if (pcie_link_speed_stats == NULL) {
 + dev_info(dev-dev, no ibm,pcie-link-speed-stats property\n);
 + return -EINVAL;
 + }
 +
 + switch (pcie_link_speed_stats[0]) {
 + case 0x02:
 + *mask |= PCIE_SPEED_50;
 + case 0x01:
 + *mask |= PCIE_SPEED_25;
 + }

I recall seeing this returns 0x00 as well. Maybe you should include both
0x00 and a default case and return EINVAL.

Regards.
Cascardo.

 +
 + return 0;
 +}
 -- 
 1.7.4.4
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] ppc/iommu: use find_first_bit to look up entries in the iommu table

2013-01-30 Thread Thadeu Lima de Souza Cascardo
On Tue, Jan 29, 2013 at 11:35:56AM +1100, Benjamin Herrenschmidt wrote:
 On Thu, 2013-01-10 at 17:33 -0200, Thadeu Lima de Souza Cascardo wrote:
  Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
  ---
  v2:
  Remove the unneeded extra variable i, which caused build failure.
 
 I believe something equivalent is already in -next, can you dbl check ?
 
 Cheers,
 Ben.
 

There is, and it's using bitmap_empty, which is even more clear.

Thanks.
Cascardo.

  ---
   arch/powerpc/kernel/iommu.c |9 ++---
   1 files changed, 2 insertions(+), 7 deletions(-)
  
  diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
  index 6d48ff8..0fc44d2 100644
  --- a/arch/powerpc/kernel/iommu.c
  +++ b/arch/powerpc/kernel/iommu.c
  @@ -708,7 +708,7 @@ struct iommu_table *iommu_init_table(struct iommu_table 
  *tbl, int nid)
   
   void iommu_free_table(struct iommu_table *tbl, const char *node_name)
   {
  -   unsigned long bitmap_sz, i;
  +   unsigned long bitmap_sz;
  unsigned int order;
   
  if (!tbl || !tbl-it_map) {
  @@ -725,14 +725,9 @@ void iommu_free_table(struct iommu_table *tbl, const 
  char *node_name)
  clear_bit(0, tbl-it_map);
   
  /* verify that table contains no entries */
  -   /* it_size is in entries, and we're examining 64 at a time */
  -   for (i = 0; i  (tbl-it_size/64); i++) {
  -   if (tbl-it_map[i] != 0) {
  +   if (find_first_bit(tbl-it_map, tbl-it_size)  tbl-it_size)
  printk(KERN_WARNING %s: Unexpected TCEs for %s\n,
  __func__, node_name);
  -   break;
  -   }
  -   }
   
  /* calculate bitmap size in bytes */
  bitmap_sz = (tbl-it_size + 7) / 8;
 
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2] ppc/iommu: use find_first_bit to look up entries in the iommu table

2013-01-10 Thread Thadeu Lima de Souza Cascardo
Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
v2:
Remove the unneeded extra variable i, which caused build failure.
---
 arch/powerpc/kernel/iommu.c |9 ++---
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 6d48ff8..0fc44d2 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -708,7 +708,7 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid)
 
 void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 {
-   unsigned long bitmap_sz, i;
+   unsigned long bitmap_sz;
unsigned int order;
 
if (!tbl || !tbl-it_map) {
@@ -725,14 +725,9 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
clear_bit(0, tbl-it_map);
 
/* verify that table contains no entries */
-   /* it_size is in entries, and we're examining 64 at a time */
-   for (i = 0; i  (tbl-it_size/64); i++) {
-   if (tbl-it_map[i] != 0) {
+   if (find_first_bit(tbl-it_map, tbl-it_size)  tbl-it_size)
printk(KERN_WARNING %s: Unexpected TCEs for %s\n,
__func__, node_name);
-   break;
-   }
-   }
 
/* calculate bitmap size in bytes */
bitmap_sz = (tbl-it_size + 7) / 8;
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] ppc/iommu: prevent false TCE leak message

2012-12-28 Thread Thadeu Lima de Souza Cascardo
On Fri, Dec 28, 2012 at 01:21:35PM +0800, Gavin Shan wrote:
 On Thu, Dec 27, 2012 at 02:28:06PM -0200, Thadeu Lima de Souza Cascardo wrote:
 When a device DMA window includes the address 0, it's reserved in the
 TCE bitmap to avoid returning that address to drivers.
 
 When the device is removed, the bitmap is checked for any mappings not
 removed by the driver, indicating a possible DMA mapping leak. Since the
 reserved address is not cleared, a message is printed, warning of such a
 leak.
 
 Check for the reservation, and clear it before checking for any other
 standing mappings.
 
 Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
 ---
  arch/powerpc/kernel/iommu.c |5 +
  1 files changed, 5 insertions(+), 0 deletions(-)
 
 diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
 index 8226c6c..226e9e5 100644
 --- a/arch/powerpc/kernel/iommu.c
 +++ b/arch/powerpc/kernel/iommu.c
 @@ -717,6 +717,11 @@ void iommu_free_table(struct iommu_table *tbl, const 
 char *node_name)
  return;
  }
 
 +/* In case we have reserved the first bit, we should not emit
 + * the warning below. */
 
 At least, the comment would look like:
 
   /*
* xxx
*/
 

Sure. New code should always follow coding style.  :-)

How do you suggest merging with the comment below? I think it's closer
to the code it comments about, so I'd rather keep it where it is.

 +if (tbl-it_offset == 0)
 +clear_bit(0, tbl-it_map);
 +
  /* verify that table contains no entries */
  /* it_size is in entries, and we're examining 64 at a time */
 
 The comment would be merged as well? :-)
 

I also considered replacing this code by this:

/* verify that table contains no entries */
-   /* it_size is in entries, and we're examining 64 at a time */
-   for (i = 0; i  (tbl-it_size/64); i++) {
-   if (tbl-it_map[i] != 0) {
+   if (find_first_bit(tbl-it_map, tbl-it_size)  tbl-it_size)
printk(KERN_WARNING %s: Unexpected TCEs for
%s\n,
__func__, node_name);
-   break;
-   }
-   }
 
I'll resend the unreserving patch with the fixed comment styling, but
without merging comments. And I will send this other patch for comments.

Regards.
Thadeu Cascardo.

 
  for (i = 0; i  (tbl-it_size/64); i++) {
 
 Thanks,
 Gavin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/2] ppc/iommu: prevent false TCE leak message

2012-12-28 Thread Thadeu Lima de Souza Cascardo
When a device DMA window includes the address 0, it's reserved in the
TCE bitmap to avoid returning that address to drivers.

When the device is removed, the bitmap is checked for any mappings not
removed by the driver, indicating a possible DMA mapping leak. Since the
reserved address is not cleared, a message is printed, warning of such a
leak.

Check for the reservation, and clear it before checking for any other
standing mappings.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
 arch/powerpc/kernel/iommu.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 8226c6c..6d48ff8 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -717,6 +717,13 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
return;
}
 
+   /*
+* In case we have reserved the first bit, we should not emit
+* the warning below.
+*/
+   if (tbl-it_offset == 0)
+   clear_bit(0, tbl-it_map);
+
/* verify that table contains no entries */
/* it_size is in entries, and we're examining 64 at a time */
for (i = 0; i  (tbl-it_size/64); i++) {
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/2] ppc/iommu: use find_first_bit to look up entries in the iommu table

2012-12-28 Thread Thadeu Lima de Souza Cascardo
Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
 arch/powerpc/kernel/iommu.c |7 +--
 1 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 6d48ff8..91e2b99 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -725,14 +725,9 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
clear_bit(0, tbl-it_map);
 
/* verify that table contains no entries */
-   /* it_size is in entries, and we're examining 64 at a time */
-   for (i = 0; i  (tbl-it_size/64); i++) {
-   if (tbl-it_map[i] != 0) {
+   if (find_first_bit(tbl-it_map, tbl-it_size)  tbl-it_size)
printk(KERN_WARNING %s: Unexpected TCEs for %s\n,
__func__, node_name);
-   break;
-   }
-   }
 
/* calculate bitmap size in bytes */
bitmap_sz = (tbl-it_size + 7) / 8;
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/2] EEH/OF: checking for CONFIG_EEH is not needed

2012-12-28 Thread Thadeu Lima de Souza Cascardo
The functions used are already defined as empty inline functions for the
case where EEH is disabled.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
 arch/powerpc/kernel/of_platform.c |4 
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/of_platform.c 
b/arch/powerpc/kernel/of_platform.c
index 2049f2d..c5fc6b2 100644
--- a/arch/powerpc/kernel/of_platform.c
+++ b/arch/powerpc/kernel/of_platform.c
@@ -71,10 +71,8 @@ static int __devinit of_pci_phb_probe(struct platform_device 
*dev)
eeh_dev_phb_init_dynamic(phb);
 
/* Register devices with EEH */
-#ifdef CONFIG_EEH
if (dev-dev.of_node-child)
eeh_add_device_tree_early(dev-dev.of_node);
-#endif /* CONFIG_EEH */
 
/* Scan the bus */
pcibios_scan_phb(phb);
@@ -88,9 +86,7 @@ static int __devinit of_pci_phb_probe(struct platform_device 
*dev)
pcibios_claim_one_bus(phb-bus);
 
/* Finish EEH setup */
-#ifdef CONFIG_EEH
eeh_add_device_tree_late(phb-bus);
-#endif
 
/* Add probed PCI devices to the device model */
pci_bus_add_devices(phb-bus);
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/2] ppc/EEH: fix crash when adding a device in a slot with DDW

2012-12-28 Thread Thadeu Lima de Souza Cascardo
The DDW code uses a eeh_dev struct from the pci_dev. However, this is
not set until eeh_add_device_late is called.

Since pci_bus_add_devices is called before eeh_add_device_late, the PCI
devices are added to the bus, making drivers' probe hooks to be called.
These will call set_dma_mask, which will call the DDW code, which will
require the eeh_dev struct from pci_dev. This would result in a crash,
due to a NULL dereference.

Calling eeh_add_device_late after pci_bus_add_devices would make the
system BUG, because device files shouldn't be added to devices there
were not added to the system. So, a new function is needed to add such
files only after pci_bus_add_devices have been called.

Cc: sta...@vger.kernel.org
Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/eeh.h   |3 +++
 arch/powerpc/kernel/of_platform.c|3 +++
 arch/powerpc/kernel/pci-common.c |7 +--
 arch/powerpc/platforms/pseries/eeh.c |   24 +++-
 4 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index b0ef738..0f816da 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -201,6 +201,7 @@ int eeh_dev_check_failure(struct eeh_dev *edev);
 void __init eeh_addr_cache_build(void);
 void eeh_add_device_tree_early(struct device_node *);
 void eeh_add_device_tree_late(struct pci_bus *);
+void eeh_add_sysfs_files(struct pci_bus *);
 void eeh_remove_bus_device(struct pci_dev *, int);
 
 /**
@@ -240,6 +241,8 @@ static inline void eeh_add_device_tree_early(struct 
device_node *dn) { }
 
 static inline void eeh_add_device_tree_late(struct pci_bus *bus) { }
 
+static inline void eeh_add_sysfs_files(struct pci_bus *bus) { }
+
 static inline void eeh_remove_bus_device(struct pci_dev *dev, int purge_pe) { }
 
 static inline void eeh_lock(void) { }
diff --git a/arch/powerpc/kernel/of_platform.c 
b/arch/powerpc/kernel/of_platform.c
index c5fc6b2..500dd32 100644
--- a/arch/powerpc/kernel/of_platform.c
+++ b/arch/powerpc/kernel/of_platform.c
@@ -91,6 +91,9 @@ static int __devinit of_pci_phb_probe(struct platform_device 
*dev)
/* Add probed PCI devices to the device model */
pci_bus_add_devices(phb-bus);
 
+   /* sysfs files should only be added after devices are added */
+   eeh_add_sysfs_files(phb-bus);
+
return 0;
 }
 
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 7f94f76..4d3de7e 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1480,11 +1480,14 @@ void pcibios_finish_adding_to_bus(struct pci_bus *bus)
pcibios_allocate_bus_resources(bus);
pcibios_claim_one_bus(bus);
 
+   /* Fixup EEH */
+   eeh_add_device_tree_late(bus);
+
/* Add new devices to global lists.  Register in proc, sysfs. */
pci_bus_add_devices(bus);
 
-   /* Fixup EEH */
-   eeh_add_device_tree_late(bus);
+   /* sysfs files should only be added after devices are added */
+   eeh_add_sysfs_files(bus);
 }
 EXPORT_SYMBOL_GPL(pcibios_finish_adding_to_bus);
 
diff --git a/arch/powerpc/platforms/pseries/eeh.c 
b/arch/powerpc/platforms/pseries/eeh.c
index 9a04322..6b73d6c 100644
--- a/arch/powerpc/platforms/pseries/eeh.c
+++ b/arch/powerpc/platforms/pseries/eeh.c
@@ -788,7 +788,6 @@ static void eeh_add_device_late(struct pci_dev *dev)
dev-dev.archdata.edev = edev;
 
eeh_addr_cache_insert_dev(dev);
-   eeh_sysfs_add_device(dev);
 }
 
 /**
@@ -815,6 +814,29 @@ void eeh_add_device_tree_late(struct pci_bus *bus)
 EXPORT_SYMBOL_GPL(eeh_add_device_tree_late);
 
 /**
+ * eeh_add_sysfs_files - Add EEH sysfs files for the indicated PCI bus
+ * @bus: PCI bus
+ *
+ * This routine must be used to add EEH sysfs files for PCI
+ * devices which are attached to the indicated PCI bus. The PCI bus
+ * is added after system boot through hotplug or dlpar.
+ */
+void eeh_add_sysfs_files(struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, bus-devices, bus_list) {
+   eeh_sysfs_add_device(dev);
+   if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+   struct pci_bus *subbus = dev-subordinate;
+   if (subbus)
+   eeh_add_sysfs_files(subbus);
+   }
+   }
+}
+EXPORT_SYMBOL_GPL(eeh_add_sysfs_files);
+
+/**
  * eeh_remove_device - Undo EEH setup for the indicated pci device
  * @dev: pci device to be removed
  * @purge_pe: remove the PE or not
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] ppc/iommu: prevent false TCE leak message

2012-12-27 Thread Thadeu Lima de Souza Cascardo
When a device DMA window includes the address 0, it's reserved in the
TCE bitmap to avoid returning that address to drivers.

When the device is removed, the bitmap is checked for any mappings not
removed by the driver, indicating a possible DMA mapping leak. Since the
reserved address is not cleared, a message is printed, warning of such a
leak.

Check for the reservation, and clear it before checking for any other
standing mappings.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
 arch/powerpc/kernel/iommu.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 8226c6c..226e9e5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -717,6 +717,11 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
return;
}
 
+   /* In case we have reserved the first bit, we should not emit
+* the warning below. */
+   if (tbl-it_offset == 0)
+   clear_bit(0, tbl-it_map);
+
/* verify that table contains no entries */
/* it_size is in entries, and we're examining 64 at a time */
for (i = 0; i  (tbl-it_size/64); i++) {
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] ppc/EEH: fix crash when adding a device in a slot with DDW

2012-12-27 Thread Thadeu Lima de Souza Cascardo
The DDW code uses a eeh_dev struct from the pci_dev. However, this is
not set until eeh_add_device_late is called.

Since pci_bus_add_devices is called before eeh_add_device_late, the PCI
devices are added to the bus, making drivers' probe hooks to be called.
These will call set_dma_mask, which will call the DDW code, which will
require the eeh_dev struct from pci_dev. This would result in a crash,
due to a NULL dereference.

Calling eeh_add_device_late after pci_bus_add_devices would make the
system BUG, because device files shouldn't be added to devices there
were not added to the system. So, a new function is needed to add such
files only after pci_bus_add_devices have been called.

Cc: sta...@vger.kernel.org
Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/eeh.h   |3 +++
 arch/powerpc/kernel/pci-common.c |7 +--
 arch/powerpc/platforms/pseries/eeh.c |   24 +++-
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index b0ef738..71aac19 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -201,6 +201,7 @@ int eeh_dev_check_failure(struct eeh_dev *edev);
 void __init eeh_addr_cache_build(void);
 void eeh_add_device_tree_early(struct device_node *);
 void eeh_add_device_tree_late(struct pci_bus *);
+void eeh_add_device_tree_files(struct pci_bus *);
 void eeh_remove_bus_device(struct pci_dev *, int);
 
 /**
@@ -240,6 +241,8 @@ static inline void eeh_add_device_tree_early(struct 
device_node *dn) { }
 
 static inline void eeh_add_device_tree_late(struct pci_bus *bus) { }
 
+static inline void eeh_add_device_tree_files(struct pci_bus *bus) { }
+
 static inline void eeh_remove_bus_device(struct pci_dev *dev, int purge_pe) { }
 
 static inline void eeh_lock(void) { }
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 7f94f76..7b1f14c 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1480,11 +1480,14 @@ void pcibios_finish_adding_to_bus(struct pci_bus *bus)
pcibios_allocate_bus_resources(bus);
pcibios_claim_one_bus(bus);
 
+   /* Fixup EEH */
+   eeh_add_device_tree_late(bus);
+
/* Add new devices to global lists.  Register in proc, sysfs. */
pci_bus_add_devices(bus);
 
-   /* Fixup EEH */
-   eeh_add_device_tree_late(bus);
+   /* Add EEH sysfs files */
+   eeh_add_device_tree_files(bus);
 }
 EXPORT_SYMBOL_GPL(pcibios_finish_adding_to_bus);
 
diff --git a/arch/powerpc/platforms/pseries/eeh.c 
b/arch/powerpc/platforms/pseries/eeh.c
index 9a04322..a667a34 100644
--- a/arch/powerpc/platforms/pseries/eeh.c
+++ b/arch/powerpc/platforms/pseries/eeh.c
@@ -788,7 +788,6 @@ static void eeh_add_device_late(struct pci_dev *dev)
dev-dev.archdata.edev = edev;
 
eeh_addr_cache_insert_dev(dev);
-   eeh_sysfs_add_device(dev);
 }
 
 /**
@@ -815,6 +814,29 @@ void eeh_add_device_tree_late(struct pci_bus *bus)
 EXPORT_SYMBOL_GPL(eeh_add_device_tree_late);
 
 /**
+ * eeh_add_device_tree_files - Add EEH sysfs files for the indicated PCI bus
+ * @bus: PCI bus
+ *
+ * This routine must be used to add EEH sysfs files for PCI
+ * devices which are attached to the indicated PCI bus. The PCI bus
+ * is added after system boot through hotplug or dlpar.
+ */
+void eeh_add_device_tree_files(struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, bus-devices, bus_list) {
+   eeh_sysfs_add_device(dev);
+   if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+   struct pci_bus *subbus = dev-subordinate;
+   if (subbus)
+   eeh_add_device_tree_files(subbus);
+   }
+   }
+}
+EXPORT_SYMBOL_GPL(eeh_add_device_tree_files);
+
+/**
  * eeh_remove_device - Undo EEH setup for the indicated pci device
  * @dev: pci device to be removed
  * @purge_pe: remove the PE or not
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] mlx4_en: map entire pages to increase throughput

2012-07-16 Thread Thadeu Lima de Souza Cascardo
On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote:
 On 07/16/2012 10:01 AM, Thadeu Lima de Souza Cascardo wrote:
 In its receive path, mlx4_en driver maps each page chunk that it pushes
 to the hardware and unmaps it when pushing it up the stack. This limits
 throughput to about 3Gbps on a Power7 8-core machine.
 
 That seems rather extraordinarily low - Power7 is supposed to be a
 rather high performance CPU.  The last time I noticed O(3Gbit/s) on
 10G for bulk transfer was before the advent of LRO/GRO - that was in
 the x86 space though.  Is mapping really that expensive with Power7?
 

Copying linuxppc-dev and Anton here. But I can tell you that we have
lock contention when doing the mapping on the same adapter (map table
per device). Anton has sent some patches that improves that *a lot*.

However, for 1500 MTU, mlx4_en was doing two unmaps and two maps per
packet. The problem is not the CPU power needed to do the mappings, but
that we find the lock contention and end up with the CPUs more than 30%
of the time spent on spin locking.

 
 One solution is to map the entire allocated page at once. However, this
 requires that we keep track of every page fragment we give to a
 descriptor. We also need to work with the discipline that all fragments will
 be released (in the sense that it will not be reused by the driver
 anymore) in the order they are allocated to the driver.
 
 This requires that we don't reuse any fragments, every single one of
 them must be reallocated. We do that by releasing all the fragments that
 are processed and only after finished processing the descriptors, we
 start the refill.
 
 We also must somehow guarantee that we either refill all fragments in a
 descriptor or none at all, without resorting to giving up a page
 fragment that we would have already given. Otherwise, we would break the
 discipline of only releasing the fragments in the order they were
 allocated.
 
 This has passed page allocation fault injections (restricted to the
 driver by using required-start and required-end) and device hotplug
 while 16 TCP streams were able to deliver more than 9Gbps.
 
 What is the effect on packet-per-second performance?  (eg aggregate,
 burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)
 

I used uperf with TCP_NODELAY and 16 threads sending from another
machine 64000-sized writes for 60 seconds.

I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s
(9.23Gb/s) with the patch.

Best regards.
Cascardo.


 rick jones
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] mlx4_en: map entire pages to increase throughput

2012-07-16 Thread Thadeu Lima de Souza Cascardo
On Mon, Jul 16, 2012 at 12:42:41PM -0700, Rick Jones wrote:
 On 07/16/2012 12:06 PM, Thadeu Lima de Souza Cascardo wrote:
 On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote:
 
 What is the effect on packet-per-second performance?  (eg aggregate,
 burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)
 
 I used uperf with TCP_NODELAY and 16 threads sending from another
 machine 64000-sized writes for 60 seconds.
 
 I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s
 (9.23Gb/s) with the patch.
 
 I was thinking more along the lines of an additional comparison,
 explicitly using netperf TCP_RR or something like it, not just the
 packets per second from a bulk transfer test.
 
 rick
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

I used a uperf profile that is similar to TCP_RR. It writes, then reads
some bytes. I kept the TCP_NODELAY flag.

Without the patch, I saw the following:

packet size ops/s   Gb/s
1   337024  0.0027
90  276620  0.199
900 190455  1.37
400068863   2.20
900045638   3.29
6   94094.52

With the patch:

packet size ops/s   Gb/s
1   451738  0.0036
90  345682  0.248
900 272258  1.96
4000127055  4.07
9000106614  7.68
6   30671   14.72

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] mlx4_en: map entire pages to increase throughput

2012-07-16 Thread Thadeu Lima de Souza Cascardo
On Mon, Jul 16, 2012 at 11:43:33PM +0300, Or Gerlitz wrote:
 On Mon, Jul 16, 2012 at 10:42 PM, Rick Jones rick.jon...@hp.com wrote:
 
  I was thinking more along the lines of an additional comparison,
  explicitly using netperf TCP_RR or something like it, not just the packets
  per second from a bulk transfer test.
 
 
 TCP_STREAM from this setup before the patch would be good to know as well
 

Hi, Or.

Does the stream test that I did with uperf using messages of 64000 bytes
fit?

TCP_NODELAY does not make a difference in this case. I get something
around 3Gbps before the patch and something around 9Gbps after the
patch.

Before the patch:

# ./uperf-1.0.3-beta/src/uperf -m tcp.xml
Starting 16 threads running profile:tcp_stream ...   0.00 seconds
Txn1  0 /1.00(s) =0  16op/s
Txn220.81GB /59.26(s) = 3.02Gb/s5914op/s
Txn3  0 /0.00(s) =0  128295op/s
---
Total   20.81GB /61.37(s) = 2.91Gb/s5712op/s

Netstat statistics for this run
---
Nic   opkts/s ipkts/s obits/s ibits/s
eth6   252459   31694   3.06Gb/s  16.74Mb/s
eth02  18   3.87Kb/s  14.28Kb/s
---

Run Statistics
Hostname   TimeData   Throughput   Operations
Errors
---
10.0.0.2 61.47s 20.81GB 2.91Gb/s   350528
0.00
master   61.37s 20.81GB 2.91Gb/s   350528
0.00
---
Difference(%) -0.16%  0.00%0.16%0.00%
0.00%


After the patch:

# ./uperf-1.0.3-beta/src/uperf -m tcp.xml
Starting 16 threads running profile:tcp_stream ...   0.00 seconds
Txn1  0 /1.00(s) =0  16op/s
Txn264.50GB /60.27(s) = 9.19Gb/s   17975op/s
Txn3  0 /0.00(s) =0
---
Total   64.50GB /62.27(s) = 8.90Gb/s   17397op/s

Netstat statistics for this run
---
Nic   opkts/s ipkts/s obits/s ibits/s
eth6   769428   96018   9.31Gb/s  50.72Mb/s
eth01  15   2.48Kb/s  13.59Kb/s
---

Run Statistics
Hostname   TimeData   Throughput   Operations
Errors
---
10.0.0.2 62.27s 64.36GB 8.88Gb/s  1081096
0.00
master   62.27s 64.50GB 8.90Gb/s  1083325
0.00
---
Difference(%) -0.00%  0.21%0.21%0.21%
0.00%


Profile tcp.xml:

?xml version=1.0?
profile name=TCP_STREAM
  group nthreads=16
transaction iterations=1
flowop type=connect options=remotehost=10.0.0.2 protocol=tcp 
tcp_nodelay/
/transaction
transaction duration=60
flowop type=write options=count=160 size=64000/
/transaction
transaction iterations=1
flowop type=disconnect /
/transaction
  /group
/profile

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] ppc/eeh: fix crash when error happens during device probe

2012-01-11 Thread Thadeu Lima de Souza Cascardo
EEH may happen during a PCI driver probe. If the driver is trying to
access some register in a loop, the EEH code will try to print the
driver name. But the driver pointer in struct pci_dev is not set until
probe returns successfully.

Use a function to test if the device and the driver pointer is NULL
before accessing the driver's name.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/ppc-pci.h   |5 +
 arch/powerpc/platforms/pseries/eeh.c |4 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-pci.h 
b/arch/powerpc/include/asm/ppc-pci.h
index 43268f1..6d42297 100644
--- a/arch/powerpc/include/asm/ppc-pci.h
+++ b/arch/powerpc/include/asm/ppc-pci.h
@@ -142,6 +142,11 @@ static inline const char *eeh_pci_name(struct pci_dev 
*pdev)
return pdev ? pci_name(pdev) : null;
 } 
 
+static inline const char *eeh_driver_name(struct pci_dev *pdev)
+{
+   return (pdev  pdev-driver) ? pdev-driver-name : null;
+}
+
 #endif /* CONFIG_EEH */
 
 #else /* CONFIG_PCI */
diff --git a/arch/powerpc/platforms/pseries/eeh.c 
b/arch/powerpc/platforms/pseries/eeh.c
index 5658690..c0b40af 100644
--- a/arch/powerpc/platforms/pseries/eeh.c
+++ b/arch/powerpc/platforms/pseries/eeh.c
@@ -551,9 +551,9 @@ int eeh_dn_check_failure(struct device_node *dn, struct 
pci_dev *dev)
printk (KERN_ERR EEH: %d reads ignored for recovering 
device at 
location=%s driver=%s pci addr=%s\n,
pdn-eeh_check_count, location,
-   dev-driver-name, eeh_pci_name(dev));
+   eeh_driver_name(dev), eeh_pci_name(dev));
printk (KERN_ERR EEH: Might be infinite loop in %s 
driver\n,
-   dev-driver-name);
+   eeh_driver_name(dev));
dump_stack();
}
goto dn_unlock;
-- 
1.7.4.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: cpuidle: Default y for pseries

2012-01-11 Thread Thadeu Lima de Souza Cascardo
On Tue, Jan 10, 2012 at 03:05:35PM -, Benjamin Herrenschmidt wrote:
 We just replaced the pseries platform idle loops with a cpuidle backend,
 however that means that you won't get any power saving and won't return
 any unused idle time to the hypervisor unless cpuidle is enabled.
 
 Thus is should default to y when pseries is enabled. I prefer that to
 a select so we can still make it modular if we want to.
 
 Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
 
 ---
 Linus, do you want to just pick that up or should I put it into powerpc.git
 and ask you to pull ? I will have 2 or 3 other fixes there later today,
 but I wanted to make sure you were ok with the approach with this
 specific one.

Hi, Ben.

Note that building with CONFIG_PSERIES_IDLE=m fails.

  CC [M]  arch/powerpc/platforms/pseries/processor_idle.o
arch/powerpc/platforms/pseries/processor_idle.c:35: error: redefinition
of ‘update_smt_snooze_delay’
/root/linux/arch/powerpc/include/asm/system.h:230: note: previous
definition of ‘update_smt_snooze_delay’ was here
arch/powerpc/platforms/pseries/processor_idle.c:175: error: redefinition
of ‘pseries_notify_cpuidle_add_cpu’
/root/linux/arch/powerpc/include/asm/system.h:231: note: previous
definition of ‘pseries_notify_cpuidle_add_cpu’ was here
make[2]: *** [arch/powerpc/platforms/pseries/processor_idle.o] Error 1
make[1]: *** [arch/powerpc/platforms/pseries] Error 2
make: *** [arch/powerpc/platforms] Error 2

asm/system.h has empty inline implementations for
update_smt_snooze_delay and pseries_notify_cpuidle_add_cpu, which are
used when CONFIG_PSERIES_IDLE is undefined. Since those two functions
are used in core power architecture functions (store_smt_snooze_delay
at kernel/sysfs.c and smp_xics_setup_cpu at platforms/pseries/smp.c),
this requires some rework in these interactions or we should simply
disable PSERIES_IDLE to be built as a module for now.

Regards.
Cascardo.

 
 diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
 index 7dbc4a8..62ca70d 100644
 --- a/drivers/cpuidle/Kconfig
 +++ b/drivers/cpuidle/Kconfig
 @@ -1,7 +1,8 @@
  
  config CPU_IDLE
   bool CPU idle PM support
 - default ACPI
 + default y if ACPI
 + default y if PPC_PSERIES
   help
 CPU idle is a generic framework for supporting software-controlled
 idle processor power management.  It includes modular cross-platform

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] mlx4_en: fix endianness with blue frame support

2011-10-10 Thread Thadeu Lima de Souza Cascardo
The doorbell register was being unconditionally swapped. In x86, that
meant it was being swapped to BE and written to the descriptor and to
memory, depending on the case of blue frame support or writing to
doorbell register. On PPC, this meant it was being swapped to LE and
then swapped back to BE while writing to the register. But in the blue
frame case, it was being written as LE to the descriptor.

The fix is not to swap doorbell unconditionally, write it to the
register as BE and convert it to BE when writing it to the descriptor.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
Reported-by: Richard Hendrickson richh...@us.ibm.com
Cc: Eli Cohen e...@dev.mellanox.co.il
Cc: Yevgeny Petrilin yevge...@mellanox.co.il
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
---
 drivers/net/mlx4/en_tx.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
index 6e03de0..f76ab6b 100644
--- a/drivers/net/mlx4/en_tx.c
+++ b/drivers/net/mlx4/en_tx.c
@@ -172,7 +172,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
memset(ring-buf, 0, ring-buf_size);
 
ring-qp_state = MLX4_QP_STATE_RST;
-   ring-doorbell_qpn = swab32(ring-qp.qpn  8);
+   ring-doorbell_qpn = ring-qp.qpn  8;
 
mlx4_en_fill_qp_context(priv, ring-size, ring-stride, 1, 0, ring-qpn,
ring-cqn, ring-context);
@@ -791,7 +791,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct 
net_device *dev)
skb_orphan(skb);
 
if (ring-bf_enabled  desc_size = MAX_BF  !bounce  !vlan_tag) {
-   *(u32 *) (tx_desc-ctrl.vlan_tag) |= ring-doorbell_qpn;
+   *(__be32 *) (tx_desc-ctrl.vlan_tag) |= 
cpu_to_be32(ring-doorbell_qpn);
op_own |= htonl((bf_index  0x)  8);
/* Ensure new descirptor hits memory
* before setting ownership of this descriptor to HW */
@@ -812,7 +812,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct 
net_device *dev)
wmb();
tx_desc-ctrl.owner_opcode = op_own;
wmb();
-   writel(ring-doorbell_qpn, ring-bf.uar-map + 
MLX4_SEND_DOORBELL);
+   iowrite32be(ring-doorbell_qpn, ring-bf.uar-map + 
MLX4_SEND_DOORBELL);
}
 
/* Poll CQ here */
-- 
1.7.4.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] mlx4_en: fix endianness with blue frame support

2011-10-10 Thread Thadeu Lima de Souza Cascardo
On Mon, Oct 10, 2011 at 01:42:23PM -0300, Thadeu Lima de Souza Cascardo wrote:
 The doorbell register was being unconditionally swapped. In x86, that
 meant it was being swapped to BE and written to the descriptor and to
 memory, depending on the case of blue frame support or writing to
 doorbell register. On PPC, this meant it was being swapped to LE and
 then swapped back to BE while writing to the register. But in the blue
 frame case, it was being written as LE to the descriptor.
 
 The fix is not to swap doorbell unconditionally, write it to the
 register as BE and convert it to BE when writing it to the descriptor.
 
 Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
 Reported-by: Richard Hendrickson richh...@us.ibm.com
 Cc: Eli Cohen e...@dev.mellanox.co.il
 Cc: Yevgeny Petrilin yevge...@mellanox.co.il
 Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
 ---

So I tested this patch and it works for me. Thanks Ben and Eli for
finding out the problem with doorbell in the descriptor.

Regards,
Cascardo.

  drivers/net/mlx4/en_tx.c |6 +++---
  1 files changed, 3 insertions(+), 3 deletions(-)
 
 diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
 index 6e03de0..f76ab6b 100644
 --- a/drivers/net/mlx4/en_tx.c
 +++ b/drivers/net/mlx4/en_tx.c
 @@ -172,7 +172,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
   memset(ring-buf, 0, ring-buf_size);
 
   ring-qp_state = MLX4_QP_STATE_RST;
 - ring-doorbell_qpn = swab32(ring-qp.qpn  8);
 + ring-doorbell_qpn = ring-qp.qpn  8;
 
   mlx4_en_fill_qp_context(priv, ring-size, ring-stride, 1, 0, ring-qpn,
   ring-cqn, ring-context);
 @@ -791,7 +791,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct 
 net_device *dev)
   skb_orphan(skb);
 
   if (ring-bf_enabled  desc_size = MAX_BF  !bounce  !vlan_tag) {
 - *(u32 *) (tx_desc-ctrl.vlan_tag) |= ring-doorbell_qpn;
 + *(__be32 *) (tx_desc-ctrl.vlan_tag) |= 
 cpu_to_be32(ring-doorbell_qpn);
   op_own |= htonl((bf_index  0x)  8);
   /* Ensure new descirptor hits memory
   * before setting ownership of this descriptor to HW */
 @@ -812,7 +812,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct 
 net_device *dev)
   wmb();
   tx_desc-ctrl.owner_opcode = op_own;
   wmb();
 - writel(ring-doorbell_qpn, ring-bf.uar-map + 
 MLX4_SEND_DOORBELL);
 + iowrite32be(ring-doorbell_qpn, ring-bf.uar-map + 
 MLX4_SEND_DOORBELL);
   }
 
   /* Poll CQ here */
 -- 
 1.7.4.4
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] mlx4_en: fix transmit of packages when blue frame isenabled

2011-10-07 Thread Thadeu Lima de Souza Cascardo
On Thu, Oct 06, 2011 at 03:10:28PM +0100, David Laight wrote:
  
   static void mlx4_bf_copy(unsigned long *dst, unsigned long *src,
 unsigned bytecnt) {
  +   int i;
  +   __le32 *psrc = (__le32 *)src;
  +
  +   /*
  +* the buffer is already in big endian. For little endian
 machines that's
  +* fine. For big endain machines we must swap since the chipset
 swaps again
  +*/
  +   for (i = 0; i  bytecnt / 4; ++i)
  +   psrc[i] = le32_to_cpu(psrc[i]);
  +
  __iowrite64_copy(dst, src, bytecnt / 8);
   }
 
 That code looks horrid...
 1) I'm not sure the caller expects the buffer to be corrupted.
 2) It contains a lot of memory cycles.
 3) It looked from the calls that this code is copying descriptors,
so the transfer length is probably 1 or 2 words - so the loop
is inefficient.
 4) ppc doesn't have a fast byteswap instruction (very new gcc might
use the byteswapping memery access for the le32_to_cpu() though),
so it would be better getting the byteswap done inside
__iowrite64_copy() - since that is probably requesting a byteswap
anyway.
 OTOH I'm not at all clear about the 64bit xfers
 
 

Plus, it doesn't work.  :-\

After doing some more investigation, I decided to add the ARP entries
manually and test for different packet sizes. Packets larger than 256
should use the non-BF path and work. Smaller packets should fail. To my
surprise, the smaller packets were fine. Tried many different sizes and
they all succeeded. Then, tried using broadcast and it worked too (after
setting the proper sysctl). Broadcast with 0, 8 and 16 ping size, all
OK. Removed the ARP entries, the ARP responses did not get to the other
side. So, I did a pretty nasty hack to not use BF for ARP packets and
after cleaning up the ARP table, everything seems to be working fine.
What follows is my patch, just for reference.

Regards,
Cascardo.

---
--- drivers/net/mlx4/en_tx.c.orig   2011-10-03 15:42:15.736104623 -0500
+++ drivers/net/mlx4/en_tx.c2011-10-07 17:12:38.023792483 -0500
@@ -627,6 +627,7 @@
int lso_header_size;
void *fragptr;
bool bounce = false;
+   bool is_arp = false;
 
if (!priv-port_up)
goto tx_drop;
@@ -698,10 +699,11 @@
priv-port_stats.tx_chksum_offload++;
}
 
+   skb_reset_mac_header(skb);
+   ethh = eth_hdr(skb);
+   is_arp = ethh  ethh-h_proto == ETH_P_ARP;
if (unlikely(priv-validate_loopback)) {
/* Copy dst mac address to wqe */
-   skb_reset_mac_header(skb);
-   ethh = eth_hdr(skb);
if (ethh  ethh-h_dest) {
mac = mlx4_en_mac_to_u64(ethh-h_dest);
mac_h = (u32) ((mac  0xULL)  16);
@@ -790,7 +792,7 @@
if (likely(!skb_shared(skb)))
skb_orphan(skb);
 
-   if (ring-bf_enabled  desc_size = MAX_BF  !bounce  !vlan_tag) {
+   if (ring-bf_enabled  desc_size = MAX_BF  !bounce  !vlan_tag  
!is_arp) {
*(u32 *) (tx_desc-ctrl.vlan_tag) |= ring-doorbell_qpn;
op_own |= htonl((bf_index  0x)  8);
/* Ensure new descirptor hits memory
---

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled

2011-10-04 Thread Thadeu Lima de Souza Cascardo
On Tue, Oct 04, 2011 at 08:02:12AM +0200, Benjamin Herrenschmidt wrote:
 On Mon, 2011-10-03 at 17:53 -0300, Thadeu Lima de Souza Cascardo wrote:
 
  .../...
 
   Can you also send me the output of ethtool -i?
   It seems that there is a problem with write combining on Power 
   processors, we will check this issue.
   
   Yevgeny
  
  Hello, Yevgeny.
  
  You will find the output of ethtool -i below.
  
  I am copying Ben and powerpc list, in case this is an issue with Power
  processors. They can provide us some more insight into this.
 
 May I get some background please ? :-)
 
 I'm not aware of a specific issue with write combining but I'd need to
 know more about what you are doing and the code to do it to comment on
 whether it should work or not.
 
 Cheers,
 Ben.
 
 

Hello, Ben.

Sorry for that. I am testing mlx4_en driver on a POWER. Yevgeny has
added blue frame support, that does not require writing to the device
memory to indicate a new packet (the doorbell register as it is called).

Well, the ring is getting full with no interrupt or packets transmitted.
I simply added a write to the doorbell register and it works for me.
Yevgeny says this is not the right fix, claiming there is a problem with
write combining on POWER. The code uses memory barriers, so I don't know
why there is any problem.

I am posting the code here to show better what the situation is.
Yevgeny can tell more about the device and the driver.

The code below is the driver as of now, including a diff with what I
changed and had resulted OK for me. Before the blue frame support, the
only code executed was the else part. I can't tell much what the device
should be seeing and doing after the blue frame part of the code is
executed. But it does send the packet if I write to the doorbell
register.

Yevgeny, can you tell us what the device should be doing and why you
think this is a problem on POWER? Is it possible that this is simply a
problem with the firmware version?

Thanks,
Cascardo.

---
if (ring-bf_enabled  desc_size = MAX_BF  !bounce 
!vlan_tag) {
*(u32 *) (tx_desc-ctrl.vlan_tag) |=
ring-doorbell_qpn;
op_own |= htonl((bf_index  0x)  8);
/* Ensure new descirptor hits memory
* before setting ownership of this descriptor to HW */
wmb();
tx_desc-ctrl.owner_opcode = op_own;

wmb();

mlx4_bf_copy(ring-bf.reg + ring-bf.offset, (unsigned
long *) tx_desc-ctrl,
 desc_size);

wmb();

ring-bf.offset ^= ring-bf.buf_size;
} else {
/* Ensure new descirptor hits memory
* before setting ownership of this descriptor to HW */
wmb();
tx_desc-ctrl.owner_opcode = op_own;
-   wmb();
-   writel(ring-doorbell_qpn, ring-bf.uar-map +
MLX4_SEND_DOORBELL);
}

+   wmb();
+   writel(ring-doorbell_qpn, ring-bf.uar-map +
MLX4_SEND_DOORBELL);
+
---

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled

2011-10-03 Thread Thadeu Lima de Souza Cascardo
On Mon, Oct 03, 2011 at 02:56:08PM +, Yevgeny Petrilin wrote:
  Hello, Yevgeny.
  
  We use a MT26448 (lspci -v output bellow) on a POWER7. Any other
  information, tests or debug patches you want me to try, just tell me.
  
  I expected this was really not the proper fix, but thought it would be
  better to send it than just guess. Any clues what the problem might be?
  Perhaps, we would have to disable blue frame for this particular device?
  
  Regards,
  Cascardo.
  
  ---
  lspci output
  0006:01:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX 
  EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
  ---
  lspci -v output
  0006:01:00.0 0200: 15b3:6750 (rev b0)
  Subsystem: 1014:0416
  Flags: bus master, fast devsel, latency 0, IRQ 31
  Memory at 3da47be0 (64-bit, non-prefetchable) [size=1M]
  Memory at 3da47c00 (64-bit, prefetchable) [size=32M]
  Expansion ROM at 3da47bf0 [disabled] [size=1M]
  Capabilities: [40] Power Management version 3
  Capabilities: [48] Vital Product Data
  Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
  Capabilities: [60] Express Endpoint, MSI 00
  Capabilities: [100] Alternative Routing-ID Interpretation (ARI)
  Capabilities: [148] Device Serial Number 00-02-c9-03-00-0f-50-be
  Kernel driver in use: mlx4_core
  Kernel modules: mlx4_en, mlx4_core
  ---
 
 Cascardo,
 
 Can you also send me the output of ethtool -i?
 It seems that there is a problem with write combining on Power processors, we 
 will check this issue.
 
 Yevgeny

Hello, Yevgeny.

You will find the output of ethtool -i below.

I am copying Ben and powerpc list, in case this is an issue with Power
processors. They can provide us some more insight into this.

Thanks,
Cascardo.

---
# ethtool -i eth10
driver: mlx4_en
version: 1.5.4.1 (March 2011)
firmware-version: 2.9.4130
bus-info: 0006:01:00.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2] powerpc: reserve iommu page 0

2011-09-21 Thread Thadeu Lima de Souza Cascardo
Some devices have a dma-window that starts at the address 0. This allows
DMA addresses to be mapped to this address and returned to drivers as a
valid DMA address. Some drivers may not behave well in this case, since
the address 0 is considered an error or not allocated.

The solution to avoid this kind of error from happening is reserve the
page addressed as 0 so it cannot be allocated for a DMA mapping.

Ben Herrenschmidt deserves the credit for this patch. He pointed out the
solution and what code would do the job.

v2:
Add a comment in the code.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: linuxppc-dev@lists.ozlabs.org
---

Resending, since it was recommended to add a comment about this
particular code.

---
 arch/powerpc/kernel/iommu.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 961bb03..8395301 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -501,6 +501,14 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid)
tbl-it_map = page_address(page);
memset(tbl-it_map, 0, sz);
 
+   /*
+* Reserve page 0 so it will not be used for any mappings.
+* This avoids buggy drivers that consider page 0 to be invalid
+* to crash the machine or even lose data.
+   */
+   if (tbl-it_offset == 0)
+   set_bit(0, tbl-it_map);
+   
tbl-it_hint = 0;
tbl-it_largehint = tbl-it_halfpoint;
spin_lock_init(tbl-it_lock);
-- 
1.7.4.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] powerpc: reserve iommu page 0

2011-09-20 Thread Thadeu Lima de Souza Cascardo
Some devices have a dma-window that starts at the address 0. This allows
DMA addresses to be mapped to this address and returned to drivers as a
valid DMA address. Some drivers may not behave well in this case, since
the address 0 is considered an error or not allocated.

The solution to avoid this kind of error from happening is reserve the
page addressed as 0 so it cannot be allocated for a DMA mapping.

Ben Herrenschmidt deserves the credit for this patch. He pointed out the
solution and what code would do the job.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/iommu.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 961bb03..53411bc 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -501,6 +501,9 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid)
tbl-it_map = page_address(page);
memset(tbl-it_map, 0, sz);
 
+   if (tbl-it_offset == 0)
+   set_bit(0, tbl-it_map);
+   
tbl-it_hint = 0;
tbl-it_largehint = tbl-it_halfpoint;
spin_lock_init(tbl-it_lock);
-- 
1.7.4.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] powerpc/eeh: fix /proc/ppc64/eeh creation

2011-08-26 Thread Thadeu Lima de Souza Cascardo
Since commit 188917e183cf9ad0374b571006d0fc6d48a7f447, /proc/ppc64 is a
symlink to /proc/powerpc/. That means that creating /proc/ppc64/eeh will
end up with a unaccessible file, that is not listed under /proc/powerpc/
and, then, not listed under /proc/ppc64/.

Creating /proc/powerpc/eeh fixes that problem and maintain the
compatibility intended with the ppc64 symlink.

Signed-off-by: Thadeu Lima de Souza Cascardo casca...@linux.vnet.ibm.com
---
 arch/powerpc/platforms/pseries/eeh.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/eeh.c 
b/arch/powerpc/platforms/pseries/eeh.c
index ada6e07..d42f37d 100644
--- a/arch/powerpc/platforms/pseries/eeh.c
+++ b/arch/powerpc/platforms/pseries/eeh.c
@@ -1338,7 +1338,7 @@ static const struct file_operations proc_eeh_operations = 
{
 static int __init eeh_init_proc(void)
 {
if (machine_is(pseries))
-   proc_create(ppc64/eeh, 0, NULL, proc_eeh_operations);
+   proc_create(powerpc/eeh, 0, NULL, proc_eeh_operations);
return 0;
 }
 __initcall(eeh_init_proc);
-- 
1.7.4.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] trivial: fix typo s/leve/level/ in powerpc arch

2010-01-17 Thread Thadeu Lima de Souza Cascardo
Signed-off-by: Thadeu Lima de Souza Cascardo casca...@holoscopio.com
---
 arch/powerpc/mm/tlb_low_64e.S |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
index f288279..8b04c54 100644
--- a/arch/powerpc/mm/tlb_low_64e.S
+++ b/arch/powerpc/mm/tlb_low_64e.S
@@ -1,5 +1,5 @@
 /*
- *  Low leve TLB miss handlers for Book3E
+ *  Low level TLB miss handlers for Book3E
  *
  *  Copyright (C) 2008-2009
  *  Ben. Herrenschmidt (b...@kernel.crashing.org), IBM Corp.
-- 
1.6.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev