Re: [PATCH v5 1/3] locking/rwsem: Remove arch specific rwsem files
On 03/22/2019 01:01 PM, Linus Torvalds wrote: > On Fri, Mar 22, 2019 at 7:30 AM Waiman Long wrote: >> 19 files changed, 133 insertions(+), 930 deletions(-) > Lovely. And it all looks sane to me. > > So ack. > > The only comment I have is about __down_read_trylock(), which probably > isn't critical enough to actually care about, but: > >> +static inline int __down_read_trylock(struct rw_semaphore *sem) >> +{ >> + long tmp; >> + >> + while ((tmp = atomic_long_read(>count)) >= 0) { >> + if (tmp == atomic_long_cmpxchg_acquire(>count, tmp, >> + tmp + RWSEM_ACTIVE_READ_BIAS)) { >> + return 1; >> + } >> + } >> + return 0; >> +} > So this seems to > > (a) read the line early (the whole cacheline in shared state issue) > > (b) read the line again unnecessarily in the while loop > > Now, (a) might be explained by "well, maybe we do trylock even with > existing readers", although I continue to think that the case we > should optimize for is simply the uncontended one, where we don't even > have multiple readers. > > But (b) just seems silly. > > So I wonder if it shouldn't just be > > long tmp = 0; > > do { > long new = atomic_long_cmpxchg_acquire(>count, tmp, > tmp + RWSEM_ACTIVE_READ_BIAS); > if (likely(new == tmp)) > return 1; >tmp = new; > } while (tmp >= 0); > return 0; > > which would seem simpler and solve both issues. Hmm? > > But honestly, I didn't check what our uses of down_read_trylock() look > like. We have more of them than I expected, and I _think_ the normal > case is the "nobody else holds the lock", but that's just a gut > feeling. > > Some of them _might_ be performance-critical. There's the one on > mmap_sem in the fault handling path, for example. And yes, I'd expect > the normal case to very much be "no other readers or writers" for that > one. > > NOTE! The above code snippet is absolutely untested, and might be > completely wrong. Take it as a "something like this" rather than > anything else. > >Linus As you have noticed already, this patch is just for moving code around without changing it. I optimize __down_read_trylock() in patch 3. Cheers, Longman
Re: [PATCH v5 2/3] locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all archs
On Fri, Mar 22, 2019 at 7:30 AM Waiman Long wrote: > > For simplication, we are going to remove rwsem-spinlock.c and make all > architectures use a single implementation of rwsem - rwsem-xadd.c. Ack. Linus
Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()
On Fri, Mar 22, 2019 at 7:30 AM Waiman Long wrote: > > Modify __down_read_trylock() to optimize for an unlocked rwsem and make > it generate slightly better code. Oh, that should teach me to read all patches in the series before starting to comment on them. So ignore my comment on #1. Linus
[PATCH v5 1/3] locking/rwsem: Remove arch specific rwsem files
As the generic rwsem-xadd code is using the appropriate acquire and release versions of the atomic operations, the arch specific rwsem.h files will not be that much faster than the generic code as long as the atomic functions are properly implemented. So we can remove those arch specific rwsem.h and stop building asm/rwsem.h to reduce maintenance effort. Currently, only x86, alpha and ia64 have implemented architecture specific fast paths. I don't have access to alpha and ia64 systems for testing, but they are legacy systems that are not likely to be updated to the latest kernel anyway. By using a rwsem microbenchmark, the total locking rates on a 4-socket 56-core 112-thread x86-64 system before and after the patch were as follows (mixed means equal # of read and write locks): Before Patch After Patch # of Threads wlock rlock mixed wlock rlock mixed - - - - - - 129,201 30,143 29,45828,615 30,172 29,201 2 6,807 13,299 1,171 7,725 15,025 1,804 4 6,504 12,755 1,520 7,127 14,286 1,345 8 6,762 13,412 764 6,826 13,652 726 16 6,693 15,408 662 6,599 15,938 626 32 6,145 15,286 496 5,549 15,487 511 64 5,812 15,495 60 5,858 15,572 60 There were some run-to-run variations for the multi-thread tests. For x86-64, using the generic C code fast path seems to be a little bit faster than the assembly version with low lock contention. Looking at the assembly version of the fast paths, there are assembly to/from C code wrappers that save and restore all the callee-clobbered registers (7 registers on x86-64). The assembly generated from the generic C code doesn't need to do that. That may explain the slight performance gain here. The generic asm rwsem.h can also be merged into kernel/locking/rwsem.h with no code change as no other code other than those under kernel/locking needs to access the internal rwsem macros and functions. Signed-off-by: Waiman Long --- MAINTAINERS | 1 - arch/alpha/include/asm/rwsem.h | 211 arch/arm/include/asm/Kbuild | 1 - arch/arm64/include/asm/Kbuild | 1 - arch/hexagon/include/asm/Kbuild | 1 - arch/ia64/include/asm/rwsem.h | 172 --- arch/powerpc/include/asm/Kbuild | 1 - arch/s390/include/asm/Kbuild| 1 - arch/sh/include/asm/Kbuild | 1 - arch/sparc/include/asm/Kbuild | 1 - arch/x86/include/asm/rwsem.h| 237 arch/x86/lib/Makefile | 1 - arch/x86/lib/rwsem.S| 156 - arch/x86/um/Makefile| 1 - arch/xtensa/include/asm/Kbuild | 1 - include/asm-generic/rwsem.h | 140 --- include/linux/rwsem.h | 4 +- kernel/locking/percpu-rwsem.c | 2 + kernel/locking/rwsem.h | 130 ++ 19 files changed, 133 insertions(+), 930 deletions(-) delete mode 100644 arch/alpha/include/asm/rwsem.h delete mode 100644 arch/ia64/include/asm/rwsem.h delete mode 100644 arch/x86/include/asm/rwsem.h delete mode 100644 arch/x86/lib/rwsem.S delete mode 100644 include/asm-generic/rwsem.h diff --git a/MAINTAINERS b/MAINTAINERS index e17ebf70b548..6bfd5a94c08e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9089,7 +9089,6 @@ F:arch/*/include/asm/spinlock*.h F: include/linux/rwlock*.h F: include/linux/mutex*.h F: include/linux/rwsem*.h -F: arch/*/include/asm/rwsem.h F: include/linux/seqlock.h F: lib/locking*.[ch] F: kernel/locking/ diff --git a/arch/alpha/include/asm/rwsem.h b/arch/alpha/include/asm/rwsem.h deleted file mode 100644 index cf8fc8f9a2ed.. --- a/arch/alpha/include/asm/rwsem.h +++ /dev/null @@ -1,211 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _ALPHA_RWSEM_H -#define _ALPHA_RWSEM_H - -/* - * Written by Ivan Kokshaysky , 2001. - * Based on asm-alpha/semaphore.h and asm-i386/rwsem.h - */ - -#ifndef _LINUX_RWSEM_H -#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead" -#endif - -#ifdef __KERNEL__ - -#include - -#define RWSEM_UNLOCKED_VALUE 0xL -#define RWSEM_ACTIVE_BIAS 0x0001L -#define RWSEM_ACTIVE_MASK 0xL -#define RWSEM_WAITING_BIAS (-0x0001L) -#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS -#define RWSEM_ACTIVE_WRITE_BIAS(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS) - -static inline int ___down_read(struct rw_semaphore *sem) -{ - long oldcount; -#ifndefCONFIG_SMP - oldcount = sem->count.counter; - sem->count.counter += RWSEM_ACTIVE_READ_BIAS; -#else - long temp; - __asm__ __volatile__( - "1: ldq_l
[PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()
Modify __down_read_trylock() to optimize for an unlocked rwsem and make it generate slightly better code. Before this patch, down_read_trylock: 0x <+0>: callq 0x5 0x0005 <+5>: jmp0x18 0x0007 <+7>: lea0x1(%rdx),%rcx 0x000b <+11>:mov%rdx,%rax 0x000e <+14>:lock cmpxchg %rcx,(%rdi) 0x0013 <+19>:cmp%rax,%rdx 0x0016 <+22>:je 0x23 0x0018 <+24>:mov(%rdi),%rdx 0x001b <+27>:test %rdx,%rdx 0x001e <+30>:jns0x7 0x0020 <+32>:xor%eax,%eax 0x0022 <+34>:retq 0x0023 <+35>:mov%gs:0x0,%rax 0x002c <+44>:or $0x3,%rax 0x0030 <+48>:mov%rax,0x20(%rdi) 0x0034 <+52>:mov$0x1,%eax 0x0039 <+57>:retq After patch, down_read_trylock: 0x <+0>: callq 0x5 0x0005 <+5>: xor%eax,%eax 0x0007 <+7>: lea0x1(%rax),%rdx 0x000b <+11>:lock cmpxchg %rdx,(%rdi) 0x0010 <+16>:jne0x29 0x0012 <+18>:mov%gs:0x0,%rax 0x001b <+27>:or $0x3,%rax 0x001f <+31>:mov%rax,0x20(%rdi) 0x0023 <+35>:mov$0x1,%eax 0x0028 <+40>:retq 0x0029 <+41>:test %rax,%rax 0x002c <+44>:jns0x7 0x002e <+46>:xor%eax,%eax 0x0030 <+48>:retq By using a rwsem microbenchmark, the down_read_trylock() rate (with a load of 10 to lengthen the lock critical section) on a x86-64 system before and after the patch were: Before PatchAfter Patch # of Threads rlock rlock - - 1 14,496 14,716 28,644 8,453 46,799 6,983 85,664 7,190 On a ARM64 system, the performance results were: Before PatchAfter Patch # of Threads rlock rlock - - 1 23,676 24,488 27,697 9,502 44,945 3,440 82,641 1,603 For the uncontended case (1 thread), the new down_read_trylock() is a little bit faster. For the contended cases, the new down_read_trylock() perform pretty well in x86-64, but performance degrades at high contention level on ARM64. Suggested-by: Linus Torvalds Signed-off-by: Waiman Long --- kernel/locking/rwsem.h | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h index 45ee00236e03..1f5775aa6a1d 100644 --- a/kernel/locking/rwsem.h +++ b/kernel/locking/rwsem.h @@ -174,14 +174,17 @@ static inline int __down_read_killable(struct rw_semaphore *sem) static inline int __down_read_trylock(struct rw_semaphore *sem) { - long tmp; + /* +* Optimize for the case when the rwsem is not locked at all. +*/ + long tmp = RWSEM_UNLOCKED_VALUE; - while ((tmp = atomic_long_read(>count)) >= 0) { - if (tmp == atomic_long_cmpxchg_acquire(>count, tmp, - tmp + RWSEM_ACTIVE_READ_BIAS)) { + do { + if (atomic_long_try_cmpxchg_acquire(>count, , + tmp + RWSEM_ACTIVE_READ_BIAS)) { return 1; } - } + } while (tmp >= 0); return 0; } -- 2.18.1
[PATCH v5 2/3] locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all archs
Currently, we have two different implementation of rwsem: 1) CONFIG_RWSEM_GENERIC_SPINLOCK (rwsem-spinlock.c) 2) CONFIG_RWSEM_XCHGADD_ALGORITHM (rwsem-xadd.c) As we are going to use a single generic implementation for rwsem-xadd.c and no architecture-specific code will be needed, there is no point in keeping two different implementations of rwsem. In most cases, the performance of rwsem-spinlock.c will be worse. It also doesn't get all the performance tuning and optimizations that had been implemented in rwsem-xadd.c over the years. For simplication, we are going to remove rwsem-spinlock.c and make all architectures use a single implementation of rwsem - rwsem-xadd.c. All references to RWSEM_GENERIC_SPINLOCK and RWSEM_XCHGADD_ALGORITHM in the code are removed. Suggested-by: Peter Zijlstra Signed-off-by: Waiman Long --- arch/alpha/Kconfig | 7 - arch/arc/Kconfig| 3 - arch/arm/Kconfig| 4 - arch/arm64/Kconfig | 3 - arch/c6x/Kconfig| 3 - arch/csky/Kconfig | 3 - arch/h8300/Kconfig | 3 - arch/hexagon/Kconfig| 6 - arch/ia64/Kconfig | 4 - arch/m68k/Kconfig | 7 - arch/microblaze/Kconfig | 6 - arch/mips/Kconfig | 7 - arch/nds32/Kconfig | 3 - arch/nios2/Kconfig | 3 - arch/openrisc/Kconfig | 6 - arch/parisc/Kconfig | 6 - arch/powerpc/Kconfig| 7 - arch/riscv/Kconfig | 3 - arch/s390/Kconfig | 6 - arch/sh/Kconfig | 6 - arch/sparc/Kconfig | 8 - arch/unicore32/Kconfig | 6 - arch/x86/Kconfig| 3 - arch/x86/um/Kconfig | 6 - arch/xtensa/Kconfig | 3 - include/linux/rwsem-spinlock.h | 47 - include/linux/rwsem.h | 5 - kernel/Kconfig.locks| 2 +- kernel/locking/Makefile | 4 +- kernel/locking/rwsem-spinlock.c | 339 kernel/locking/rwsem.h | 3 - 31 files changed, 2 insertions(+), 520 deletions(-) delete mode 100644 include/linux/rwsem-spinlock.h delete mode 100644 kernel/locking/rwsem-spinlock.c diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index 584a6e114853..27c871227eee 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -49,13 +49,6 @@ config MMU bool default y -config RWSEM_GENERIC_SPINLOCK - bool - -config RWSEM_XCHGADD_ALGORITHM - bool - default y - config ARCH_HAS_ILOG2_U32 bool default n diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index c781e45d1d99..23e063df5d2c 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -63,9 +63,6 @@ config SCHED_OMIT_FRAME_POINTER config GENERIC_CSUM def_bool y -config RWSEM_GENERIC_SPINLOCK - def_bool y - config ARCH_DISCONTIGMEM_ENABLE def_bool n diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 054ead960f98..c11c61093c6c 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -178,10 +178,6 @@ config TRACE_IRQFLAGS_SUPPORT bool default !CPU_V7M -config RWSEM_XCHGADD_ALGORITHM - bool - default y - config ARCH_HAS_ILOG2_U32 bool diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 7e34b9eba5de..c62b9db2b5e8 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -237,9 +237,6 @@ config LOCKDEP_SUPPORT config TRACE_IRQFLAGS_SUPPORT def_bool y -config RWSEM_XCHGADD_ALGORITHM - def_bool y - config GENERIC_BUG def_bool y depends on BUG diff --git a/arch/c6x/Kconfig b/arch/c6x/Kconfig index e5cd3c5f8399..ed92b5840c0a 100644 --- a/arch/c6x/Kconfig +++ b/arch/c6x/Kconfig @@ -27,9 +27,6 @@ config MMU config FPU def_bool n -config RWSEM_GENERIC_SPINLOCK - def_bool y - config GENERIC_CALIBRATE_DELAY def_bool y diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index 725a115759c9..6555d1781132 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -92,9 +92,6 @@ config GENERIC_HWEIGHT config MMU def_bool y -config RWSEM_GENERIC_SPINLOCK - def_bool y - config STACKTRACE_SUPPORT def_bool y diff --git a/arch/h8300/Kconfig b/arch/h8300/Kconfig index c071da34e081..61c01db6c292 100644 --- a/arch/h8300/Kconfig +++ b/arch/h8300/Kconfig @@ -27,9 +27,6 @@ config H8300 config CPU_BIG_ENDIAN def_bool y -config RWSEM_GENERIC_SPINLOCK - def_bool y - config GENERIC_HWEIGHT def_bool y diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig index ac441680dcc0..3e54a53208d5 100644 --- a/arch/hexagon/Kconfig +++ b/arch/hexagon/Kconfig @@ -65,12 +65,6 @@ config GENERIC_CSUM config GENERIC_IRQ_PROBE def_bool y -config RWSEM_GENERIC_SPINLOCK - def_bool n - -config RWSEM_XCHGADD_ALGORITHM - def_bool y - config GENERIC_HWEIGHT
[PATCH v5 0/3] locking/rwsem: Rwsem rearchitecture part 0
v5: - Rebase to the latest v5.1 tree and fix conflicts in arch/{xtensa,s390}/include/asm/Kbuild. v4: - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c. v3: - Optimize __down_read_trylock() for the uncontended case as suggested by Linus. v2: - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ. - Update performance test data in patch 1. The goal of this patchset is to remove the architecture specific files for rwsem-xadd to make it easer to add enhancements in the later rwsem patches. It also removes the legacy rwsem-spinlock.c file and make all the architectures use one single implementation of rwsem - rwsem-xadd.c. Waiman Long (3): locking/rwsem: Remove arch specific rwsem files locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all archs locking/rwsem: Optimize down_read_trylock() MAINTAINERS | 1 - arch/alpha/Kconfig | 7 - arch/alpha/include/asm/rwsem.h | 211 arch/arc/Kconfig| 3 - arch/arm/Kconfig| 4 - arch/arm/include/asm/Kbuild | 1 - arch/arm64/Kconfig | 3 - arch/arm64/include/asm/Kbuild | 1 - arch/c6x/Kconfig| 3 - arch/csky/Kconfig | 3 - arch/h8300/Kconfig | 3 - arch/hexagon/Kconfig| 6 - arch/hexagon/include/asm/Kbuild | 1 - arch/ia64/Kconfig | 4 - arch/ia64/include/asm/rwsem.h | 172 arch/m68k/Kconfig | 7 - arch/microblaze/Kconfig | 6 - arch/mips/Kconfig | 7 - arch/nds32/Kconfig | 3 - arch/nios2/Kconfig | 3 - arch/openrisc/Kconfig | 6 - arch/parisc/Kconfig | 6 - arch/powerpc/Kconfig| 7 - arch/powerpc/include/asm/Kbuild | 1 - arch/riscv/Kconfig | 3 - arch/s390/Kconfig | 6 - arch/s390/include/asm/Kbuild| 1 - arch/sh/Kconfig | 6 - arch/sh/include/asm/Kbuild | 1 - arch/sparc/Kconfig | 8 - arch/sparc/include/asm/Kbuild | 1 - arch/unicore32/Kconfig | 6 - arch/x86/Kconfig| 3 - arch/x86/include/asm/rwsem.h| 237 -- arch/x86/lib/Makefile | 1 - arch/x86/lib/rwsem.S| 156 --- arch/x86/um/Kconfig | 6 - arch/x86/um/Makefile| 1 - arch/xtensa/Kconfig | 3 - arch/xtensa/include/asm/Kbuild | 1 - include/asm-generic/rwsem.h | 140 - include/linux/rwsem-spinlock.h | 47 - include/linux/rwsem.h | 9 +- kernel/Kconfig.locks| 2 +- kernel/locking/Makefile | 4 +- kernel/locking/percpu-rwsem.c | 2 + kernel/locking/rwsem-spinlock.c | 339 kernel/locking/rwsem.h | 130 48 files changed, 135 insertions(+), 1447 deletions(-) delete mode 100644 arch/alpha/include/asm/rwsem.h delete mode 100644 arch/ia64/include/asm/rwsem.h delete mode 100644 arch/x86/include/asm/rwsem.h delete mode 100644 arch/x86/lib/rwsem.S delete mode 100644 include/asm-generic/rwsem.h delete mode 100644 include/linux/rwsem-spinlock.h delete mode 100644 kernel/locking/rwsem-spinlock.c -- 2.18.1
PRIVATE...
I have a business Proposal that will be of benefit to the both of us.Kindly contact me on mrmichealwu...@yahoo.com.hk should this be of interest to you.
Re: [PATCH v2 13/13] syscall_get_arch: add "struct task_struct *" argument
On Sun, Mar 17, 2019 at 7:30 PM Dmitry V. Levin wrote: > > This argument is required to extend the generic ptrace API with > PTRACE_GET_SYSCALL_INFO request: syscall_get_arch() is going > to be called from ptrace_request() along with syscall_get_nr(), > syscall_get_arguments(), syscall_get_error(), and > syscall_get_return_value() functions with a tracee as their argument. > > The primary intent is that the triple (audit_arch, syscall_nr, arg1..arg6) > should describe what system call is being called and what its arguments > are. > > Reverts: 5e937a9ae913 ("syscall_get_arch: remove useless function arguments") > Reverts: 1002d94d3076 ("syscall.h: fix doc text for syscall_get_arch()") > Reviewed-by: Andy Lutomirski # for x86 > Reviewed-by: Palmer Dabbelt > Acked-by: Paul Moore > Acked-by: Paul Burton # MIPS parts > Acked-by: Michael Ellerman (powerpc) > Acked-by: Kees Cook # seccomp parts > Acked-by: Mark Salter # for the c6x bit > Cc: Elvira Khabirova > Cc: Eugene Syromyatnikov > Cc: Oleg Nesterov > Cc: x...@kernel.org > Cc: linux-alpha@vger.kernel.org > Cc: linux-snps-...@lists.infradead.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linux-c6x-...@linux-c6x.org > Cc: uclinux-h8-de...@lists.sourceforge.jp > Cc: linux-hexa...@vger.kernel.org > Cc: linux-i...@vger.kernel.org > Cc: linux-m...@lists.linux-m68k.org > Cc: linux-m...@vger.kernel.org > Cc: nios2-...@lists.rocketboards.org > Cc: openr...@lists.librecores.org > Cc: linux-par...@vger.kernel.org > Cc: linuxppc-...@lists.ozlabs.org > Cc: linux-ri...@lists.infradead.org > Cc: linux-s...@vger.kernel.org > Cc: linux...@vger.kernel.org > Cc: sparcli...@vger.kernel.org > Cc: linux...@lists.infradead.org > Cc: linux-xte...@linux-xtensa.org > Cc: linux-a...@vger.kernel.org > Cc: linux-au...@redhat.com > Signed-off-by: Dmitry V. Levin > --- > > Notes: > v2: unchanged > > arch/alpha/include/asm/syscall.h | 2 +- > arch/arc/include/asm/syscall.h| 2 +- > arch/arm/include/asm/syscall.h| 2 +- > arch/arm64/include/asm/syscall.h | 4 ++-- > arch/c6x/include/asm/syscall.h| 2 +- > arch/csky/include/asm/syscall.h | 2 +- > arch/h8300/include/asm/syscall.h | 2 +- > arch/hexagon/include/asm/syscall.h| 2 +- > arch/ia64/include/asm/syscall.h | 2 +- > arch/m68k/include/asm/syscall.h | 2 +- > arch/microblaze/include/asm/syscall.h | 2 +- > arch/mips/include/asm/syscall.h | 6 +++--- > arch/mips/kernel/ptrace.c | 2 +- > arch/nds32/include/asm/syscall.h | 2 +- > arch/nios2/include/asm/syscall.h | 2 +- > arch/openrisc/include/asm/syscall.h | 2 +- > arch/parisc/include/asm/syscall.h | 4 ++-- > arch/powerpc/include/asm/syscall.h| 10 -- > arch/riscv/include/asm/syscall.h | 2 +- > arch/s390/include/asm/syscall.h | 4 ++-- > arch/sh/include/asm/syscall_32.h | 2 +- > arch/sh/include/asm/syscall_64.h | 2 +- > arch/sparc/include/asm/syscall.h | 5 +++-- > arch/unicore32/include/asm/syscall.h | 2 +- > arch/x86/include/asm/syscall.h| 8 +--- > arch/x86/um/asm/syscall.h | 2 +- > arch/xtensa/include/asm/syscall.h | 2 +- > include/asm-generic/syscall.h | 5 +++-- > kernel/auditsc.c | 4 ++-- > kernel/seccomp.c | 4 ++-- > 30 files changed, 52 insertions(+), 42 deletions(-) Merged into audit/next, thanks everyone. -- paul moore www.paul-moore.com
Re: [PATCH] y2038: fix socket.h header inclusion
* Arnd Bergmann: > Ok, so not '__fds_bits'. > > Is '__kernel_fds_bits' ok? I would prefer to keep at least the > name __kernel_ namespace that we have for typedefs and the > occasional struct tag. glibc should be okay with that. We use __kernel_ in the math libraries for something completely different, but those files do not (or should not) include UAPI headers, and in any case, the set of such identifiers is really small.
Re: [PATCH] y2038: fix socket.h header inclusion
On Mon, Mar 18, 2019 at 2:12 PM Florian Weimer wrote: > > On Mon, Mar 18, 2019 at 10:25 AM Florian Weimer wrote: > >> > >> * Arnd Bergmann: > >> > >> > Should we just remove __kernel_fd_set from the exported headers and > >> > define the internal fd_set directly in include/linux/types.h? (Adding the > >> > folks from the old thread to Cc). > >> > >> The type is used in the sanitizers, but incorrectly. They assume that > >> FD_SETSIZE is always 1024. (The existence of __kernel_fd_set is > >> itself somewhat questionable because it leads to such bugs.) > >> Moving around the type could cause a build failure in the sanitizers, but > >> I'm > >> not entirely clear how the UAPI headers are included there. > > > > It looks like sanitizer_platform_limits_posix.cc includes > > linux/posix_types.h to ensure that __kernel_fd_set is the same > > size as __sanitizer___kernel_fd_set, and then it uses the > > latter afterwards. > > > > What I don't see here is what kind of operation is actually done > > on the data, I only see a cast to void. > > I think it is used to assert that the select family of system calls > writes to the 1024 bits for each of the passed pointers. Yes, that is what I expected to see in libsanitizer, I just couldn't find any code that actually does this check. > Which is not actually true—the write size is controlled by the > file descriptor count argument. Yes, of course. In fact, I see multiple possible problems that - kernel reading uninitialized data if 'FD_ZERO()' was used with a shorter size than the count argument. - kernel writing beyond the fd_set data on stack when the declaration had a shorter size than the count argument. Each one could happen either because __FD_SETSIZE is smaller than 'count', or because kernel and user space disagree on the element size (32 vs 64 bit on x32). > > If libsanitizer actually does > > anything interesting here, we should definitely fix it to use the > > correct size, especially since this is actually something that > > can trigger a buffer overflow in subtle ways when used carelessly. > > See for example [1], which we still have not addressed > > The footnote is missing. Sorry, I meant [1] https://patchwork.kernel.org/patch/10245053/ > > For this specific use (and probably others like it), renaming the > > fds_bits member to __kernel_fds_bits or something like that > > would keep user space still compiling. That would only break > > if someone was using __kernel_fd_set, and actually doing > > bit operations on it. glibc uses '__fds_bits' unless __USE_XOPEN > > is set, so maybe we should use use that name unconditionally. > > Please use something that is more obviously Linux-specific. Ok, so not '__fds_bits'. Is '__kernel_fds_bits' ok? I would prefer to keep at least the name __kernel_ namespace that we have for typedefs and the occasional struct tag. Arnd
Re: [PATCH] y2038: fix socket.h header inclusion
* Arnd Bergmann: > On Mon, Mar 18, 2019 at 10:25 AM Florian Weimer wrote: >> >> * Arnd Bergmann: >> >> > Should we just remove __kernel_fd_set from the exported headers and >> > define the internal fd_set directly in include/linux/types.h? (Adding the >> > folks from the old thread to Cc). >> >> The type is used in the sanitizers, but incorrectly. They assume that >> FD_SETSIZE is always 1024. (The existence of __kernel_fd_set is >> itself somewhat questionable because it leads to such bugs.) >> Moving around the type could cause a build failure in the sanitizers, but I'm >> not entirely clear how the UAPI headers are included there. > > It looks like sanitizer_platform_limits_posix.cc includes > linux/posix_types.h to ensure that __kernel_fd_set is the same > size as __sanitizer___kernel_fd_set, and then it uses the > latter afterwards. > > What I don't see here is what kind of operation is actually done > on the data, I only see a cast to void. I think it is used to assert that the select family of system calls writes to the 1024 bits for each of the passed pointers. Which is not actually true—the write size is controlled by the file descriptor count argument. > If libsanitizer actually does > anything interesting here, we should definitely fix it to use the > correct size, especially since this is actually something that > can trigger a buffer overflow in subtle ways when used carelessly. > See for example [1], which we still have not addressed The footnote is missing. > For this specific use (and probably others like it), renaming the > fds_bits member to __kernel_fds_bits or something like that > would keep user space still compiling. That would only break > if someone was using __kernel_fd_set, and actually doing > bit operations on it. glibc uses '__fds_bits' unless __USE_XOPEN > is set, so maybe we should use use that name unconditionally. Please use something that is more obviously Linux-specific.
Re: [PATCH] y2038: fix socket.h header inclusion
* Arnd Bergmann: > Should we just remove __kernel_fd_set from the exported headers and > define the internal fd_set directly in include/linux/types.h? (Adding the > folks from the old thread to Cc). The type is used in the sanitizers, but incorrectly. They assume that FD_SETSIZE is always 1024. (The existence of __kernel_fd_set is itself somewhat questionable because it leads to such bugs.) Moving around the type could cause a build failure in the sanitizers, but I'm not entirely clear how the UAPI headers are included there. Otherwise, I couldn't find any uses.
Re: [PATCH] y2038: fix socket.h header inclusion
On Sun, Mar 17, 2019 at 7:20 PM Deepa Dinamani wrote: > On Fri, Mar 15, 2019 at 2:20 PM Florian Weimer wrote: > > > On Thu, Mar 14, 2019 at 7:41 PM Florian Weimer wrote: > > >> > diff --git a/arch/alpha/include/uapi/asm/socket.h > > >> > b/arch/alpha/include/uapi/asm/socket.h > > >> > index 0d0fddb7e738..976e89b116e5 100644 > > >> > --- a/arch/alpha/include/uapi/asm/socket.h > > >> > +++ b/arch/alpha/include/uapi/asm/socket.h > > >> > @@ -2,8 +2,8 @@ > > >> > #ifndef _UAPI_ASM_SOCKET_H > > >> > #define _UAPI_ASM_SOCKET_H > > >> > > > >> > +#include > > >> > #include > > >> > -#include > > >> > > >> This breaks POSIX conformance in glibc because the > > >> header is not namespace clean. It contains the > > >> identifiers fds_bits and val: > > >> > > >> unsigned long fds_bits[__FD_SETSIZE / (8 * sizeof(long))]; > > >> > > >> int val[2]; > > > > > > What is problematic about the struct members here? I had thought that > > > only the struct names have to be in a namespace to be usable here, > > > but not the members. > > > > According POSIX, a user can do this: > > > > #define fds_bits 1024 > > > > before including the header file. Similarly for val. > > > > Since glibc pulls in indirectly, the result is a parse > > error, even though the programmer did nothing wrong (fds_bits is not > > an identifier used by POSIX, nor is it in the implementation > > namespace, ans is a POSIX header). Ok, I see. Thanks for the explanation! > > > We could use asm/posix_types.h instead of linux/posix_types.h, > > > would that address your concern? > > > > It should fix the fds_bits case, I think. But > > still uses val, so that part of the issue > > remains. > > Would moving kernel namespace types(__kernel prefix) to a different > header file(kernel_types.h?) and then including this from > linux/posix_types.h. > And, for socket.h just including kernel_types.h make sense? I fear we have considered linux/posix_types.h to be something that can be included anywhere for a long time, so it may be better to ensure that this is actually the case, and avoid the problem with those two structures but leave the rest untouched. I think we can move __kernel_fsid_t into include/uapi/asm-generic/statfs.h, which is the only thing that needs it anyway. We have two definitions of it today, the non-generic one being for mips32, but incidentally there was a patch the other day to remove that and use the generic one instead. With that done, we can change asm/socket.h to just use asm/posix_types.h. I would still prefer to solve the problem for linux/posix_types.h as well, but I'm not sure even how __kernel_fd_set is used today in user space, if at all. Commit 8ded2bbc1845 ("posix_types.h: Cleanup stale __NFDBITS and related definitions") removed most of the fd_set definition after a long discussion [1], and since then it has been basically impossible to use 'struct fd_set' from the kernel in a meaningful way without including the libc headers or duplicating them. Should we just remove __kernel_fd_set from the exported headers and define the internal fd_set directly in include/linux/types.h? (Adding the folks from the old thread to Cc). Arnd [1] https://lore.kernel.org/lkml/20120724181209.ga10...@zod.bos.redhat.com/t/
[PATCH v2 13/13] syscall_get_arch: add "struct task_struct *" argument
This argument is required to extend the generic ptrace API with PTRACE_GET_SYSCALL_INFO request: syscall_get_arch() is going to be called from ptrace_request() along with syscall_get_nr(), syscall_get_arguments(), syscall_get_error(), and syscall_get_return_value() functions with a tracee as their argument. The primary intent is that the triple (audit_arch, syscall_nr, arg1..arg6) should describe what system call is being called and what its arguments are. Reverts: 5e937a9ae913 ("syscall_get_arch: remove useless function arguments") Reverts: 1002d94d3076 ("syscall.h: fix doc text for syscall_get_arch()") Reviewed-by: Andy Lutomirski # for x86 Reviewed-by: Palmer Dabbelt Acked-by: Paul Moore Acked-by: Paul Burton # MIPS parts Acked-by: Michael Ellerman (powerpc) Acked-by: Kees Cook # seccomp parts Acked-by: Mark Salter # for the c6x bit Cc: Elvira Khabirova Cc: Eugene Syromyatnikov Cc: Oleg Nesterov Cc: x...@kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-snps-...@lists.infradead.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-c6x-...@linux-c6x.org Cc: uclinux-h8-de...@lists.sourceforge.jp Cc: linux-hexa...@vger.kernel.org Cc: linux-i...@vger.kernel.org Cc: linux-m...@lists.linux-m68k.org Cc: linux-m...@vger.kernel.org Cc: nios2-...@lists.rocketboards.org Cc: openr...@lists.librecores.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-...@lists.ozlabs.org Cc: linux-ri...@lists.infradead.org Cc: linux-s...@vger.kernel.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: linux...@lists.infradead.org Cc: linux-xte...@linux-xtensa.org Cc: linux-a...@vger.kernel.org Cc: linux-au...@redhat.com Signed-off-by: Dmitry V. Levin --- Notes: v2: unchanged arch/alpha/include/asm/syscall.h | 2 +- arch/arc/include/asm/syscall.h| 2 +- arch/arm/include/asm/syscall.h| 2 +- arch/arm64/include/asm/syscall.h | 4 ++-- arch/c6x/include/asm/syscall.h| 2 +- arch/csky/include/asm/syscall.h | 2 +- arch/h8300/include/asm/syscall.h | 2 +- arch/hexagon/include/asm/syscall.h| 2 +- arch/ia64/include/asm/syscall.h | 2 +- arch/m68k/include/asm/syscall.h | 2 +- arch/microblaze/include/asm/syscall.h | 2 +- arch/mips/include/asm/syscall.h | 6 +++--- arch/mips/kernel/ptrace.c | 2 +- arch/nds32/include/asm/syscall.h | 2 +- arch/nios2/include/asm/syscall.h | 2 +- arch/openrisc/include/asm/syscall.h | 2 +- arch/parisc/include/asm/syscall.h | 4 ++-- arch/powerpc/include/asm/syscall.h| 10 -- arch/riscv/include/asm/syscall.h | 2 +- arch/s390/include/asm/syscall.h | 4 ++-- arch/sh/include/asm/syscall_32.h | 2 +- arch/sh/include/asm/syscall_64.h | 2 +- arch/sparc/include/asm/syscall.h | 5 +++-- arch/unicore32/include/asm/syscall.h | 2 +- arch/x86/include/asm/syscall.h| 8 +--- arch/x86/um/asm/syscall.h | 2 +- arch/xtensa/include/asm/syscall.h | 2 +- include/asm-generic/syscall.h | 5 +++-- kernel/auditsc.c | 4 ++-- kernel/seccomp.c | 4 ++-- 30 files changed, 52 insertions(+), 42 deletions(-) diff --git a/arch/alpha/include/asm/syscall.h b/arch/alpha/include/asm/syscall.h index d73a6fcb519c..11c688c1d7ec 100644 --- a/arch/alpha/include/asm/syscall.h +++ b/arch/alpha/include/asm/syscall.h @@ -4,7 +4,7 @@ #include -static inline int syscall_get_arch(void) +static inline int syscall_get_arch(struct task_struct *task) { return AUDIT_ARCH_ALPHA; } diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h index c7fc4c0c3bcb..caf2697ef5b7 100644 --- a/arch/arc/include/asm/syscall.h +++ b/arch/arc/include/asm/syscall.h @@ -70,7 +70,7 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, } static inline int -syscall_get_arch(void) +syscall_get_arch(struct task_struct *task) { return IS_ENABLED(CONFIG_ISA_ARCOMPACT) ? (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h index 06dea6bce293..3940ceac0bdc 100644 --- a/arch/arm/include/asm/syscall.h +++ b/arch/arm/include/asm/syscall.h @@ -104,7 +104,7 @@ static inline void syscall_set_arguments(struct task_struct *task, memcpy(>ARM_r0 + i, args, n * sizeof(args[0])); } -static inline int syscall_get_arch(void) +static inline int syscall_get_arch(struct task_struct *task) { /* ARM tasks don't change audit architectures on the fly. */ return AUDIT_ARCH_ARM; diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h index ad8be16a39c9..1870df03f774 100644 --- a/arch/arm64/include/asm/syscall.h +++ b/arch/arm64/include/asm/syscall.h @@ -117,9 +117,9 @@ static inline void syscall_set_arguments(struct task_struct *task, * We don't care about endianness (__AUDIT_ARCH_LE bit) here because * AArch64 has the same system
Re: [PATCH] y2038: fix socket.h header inclusion
On Fri, Mar 15, 2019 at 2:20 PM Florian Weimer wrote: > > * Arnd Bergmann: > > > On Thu, Mar 14, 2019 at 7:41 PM Florian Weimer wrote: > >> > >> * Arnd Bergmann: > >> > >> > diff --git a/arch/alpha/include/uapi/asm/socket.h > >> > b/arch/alpha/include/uapi/asm/socket.h > >> > index 0d0fddb7e738..976e89b116e5 100644 > >> > --- a/arch/alpha/include/uapi/asm/socket.h > >> > +++ b/arch/alpha/include/uapi/asm/socket.h > >> > @@ -2,8 +2,8 @@ > >> > #ifndef _UAPI_ASM_SOCKET_H > >> > #define _UAPI_ASM_SOCKET_H > >> > > >> > +#include > >> > #include > >> > -#include > >> > >> This breaks POSIX conformance in glibc because the > >> header is not namespace clean. It contains the > >> identifiers fds_bits and val: > >> > >> unsigned long fds_bits[__FD_SETSIZE / (8 * sizeof(long))]; > >> > >> int val[2]; > > > > What is problematic about the struct members here? I had thought that > > only the struct names have to be in a namespace to be usable here, > > but not the members. > > According POSIX, a user can do this: > > #define fds_bits 1024 > > before including the header file. Similarly for val. > > Since glibc pulls in indirectly, the result is a parse > error, even though the programmer did nothing wrong (fds_bits is not > an identifier used by POSIX, nor is it in the implementation > namespace, ans is a POSIX header). > > > We could use asm/posix_types.h instead of linux/posix_types.h, > > would that address your concern? > > It should fix the fds_bits case, I think. But > still uses val, so that part of the issue > remains. Would moving kernel namespace types(__kernel prefix) to a different header file(kernel_types.h?) and then including this from linux/posix_types.h. And, for socket.h just including kernel_types.h make sense? -Deepa
Re: [PATCH] y2038: fix socket.h header inclusion
* Arnd Bergmann: > On Thu, Mar 14, 2019 at 7:41 PM Florian Weimer wrote: >> >> * Arnd Bergmann: >> >> > diff --git a/arch/alpha/include/uapi/asm/socket.h >> > b/arch/alpha/include/uapi/asm/socket.h >> > index 0d0fddb7e738..976e89b116e5 100644 >> > --- a/arch/alpha/include/uapi/asm/socket.h >> > +++ b/arch/alpha/include/uapi/asm/socket.h >> > @@ -2,8 +2,8 @@ >> > #ifndef _UAPI_ASM_SOCKET_H >> > #define _UAPI_ASM_SOCKET_H >> > >> > +#include >> > #include >> > -#include >> >> This breaks POSIX conformance in glibc because the >> header is not namespace clean. It contains the >> identifiers fds_bits and val: >> >> unsigned long fds_bits[__FD_SETSIZE / (8 * sizeof(long))]; >> >> int val[2]; > > What is problematic about the struct members here? I had thought that > only the struct names have to be in a namespace to be usable here, > but not the members. According POSIX, a user can do this: #define fds_bits 1024 before including the header file. Similarly for val. Since glibc pulls in indirectly, the result is a parse error, even though the programmer did nothing wrong (fds_bits is not an identifier used by POSIX, nor is it in the implementation namespace, ans is a POSIX header). > We could use asm/posix_types.h instead of linux/posix_types.h, > would that address your concern? It should fix the fds_bits case, I think. But still uses val, so that part of the issue remains.
Re: [PATCH] y2038: fix socket.h header inclusion
On Thu, Mar 14, 2019 at 7:41 PM Florian Weimer wrote: > > * Arnd Bergmann: > > > diff --git a/arch/alpha/include/uapi/asm/socket.h > > b/arch/alpha/include/uapi/asm/socket.h > > index 0d0fddb7e738..976e89b116e5 100644 > > --- a/arch/alpha/include/uapi/asm/socket.h > > +++ b/arch/alpha/include/uapi/asm/socket.h > > @@ -2,8 +2,8 @@ > > #ifndef _UAPI_ASM_SOCKET_H > > #define _UAPI_ASM_SOCKET_H > > > > +#include > > #include > > -#include > > This breaks POSIX conformance in glibc because the > header is not namespace clean. It contains the > identifiers fds_bits and val: > > unsigned long fds_bits[__FD_SETSIZE / (8 * sizeof(long))]; > > int val[2]; What is problematic about the struct members here? I had thought that only the struct names have to be in a namespace to be usable here, but not the members. The only part that might be problematic is #undef __FD_SETSIZE #define __FD_SETSIZE1024 but we already get that from a number of other inclusions of linux/posix_types.h. Is this what you mean? > We could duplicate some of the SO_* constants for POSIX mode in glibc, > but it would be nice to avoid that. > > Is there a different way of fixing this on the kernel side that avoids > including ? We could use asm/posix_types.h instead of linux/posix_types.h, would that address your concern? Arnd
HI
-- I am Captain JAMES WILLIAMS from united states, but I am currently in Syria for peace keeping mission. I am the commanding officer of the third Battalion soldier regime. Please forgive my manners I am not good when it comes to the Internet because that is not really my field. I came across your email confidently when I was searching for a reliable person and I suddenly got your email address and I guess it's for a good purpose. Here in Syria we are not allowed to go out that makes it very bored for me so I just think I need a friend to talk to outside to keep me going. I would like to know more about you as a friend. Respectfully, James williams --
Re: [PATCH] x86: Deprecate a.out support
On Tue, Mar 12, 2019 at 9:46 AM Geert Uytterhoeven wrote: > > Yeah, the alphas on the server side, powering AltaVista, are also long > gone... As usual with these things, people can still use older Linux releases for a very long time. If they really need it (e.g. commercially), they have the reference code and can bring it up to speed again relatively easily. It is not as if they have no way to submit it back again. Cheers, Miguel
Re: [PATCH] x86: Deprecate a.out support
On Mon, Mar 11, 2019 at 10:46 PM Linus Torvalds wrote: > On Mon, Mar 11, 2019 at 2:34 PM Arnd Bergmann wrote: > > The main historic use case I've heard of was running Netscape > > Navigator on Alpha Linux, before there was an open source version. > > Doing this today to connect to the open internet is probably > > a bit pointless, but there may be other use cases. > > The _really_ main version was that I decided to make my life easier > for the initial alpha port by trying to run basic (tested) OSF/1 > binaries directly. > > Netscape may have been one of the binaries people actually ended up > using, but it's probably not a reason any more, since the internet has > moved past that anyway. Yeah, the alphas on the server side, powering AltaVista, are also long gone... Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH] x86: Deprecate a.out support
On Mon, Mar 11, 2019 at 03:11:55PM -0700, Matt Turner wrote: > On Mon, Mar 11, 2019 at 2:34 PM Arnd Bergmann wrote: > > On Mon, Mar 11, 2019 at 8:47 PM Måns Rullgård wrote: > > > Linus Torvalds writes: > > > > On Mon, Mar 11, 2019 at 11:08 AM Måns Rullgård wrote: > > > > We don't have any specific support for ECOFF. > > > > > > > > I _think_. Again, it's been years and years. I agree. I personally have never run any OSF/1 executables on Linux Alpha and have no interest in doing so. > > The main historic use case I've heard of was running Netscape > > Navigator on Alpha Linux, before there was an open source version. > > Doing this today to connect to the open internet is probably > > a bit pointless, but there may be other use cases. > > The best use case I know of is to run their C compiler. Måns sent > patches in fact to make it work. > > There is a Linux version of the same compiler but I have a vague > memory that it's broken in various ways that the Tru64 version is > not. The last time I tried the Compaq C compiler for Alpha-Linux it still worked, well, that is, the compiler worked, but the library header files are broken and haven't worked with glibc for a long time. So it is only useful as a free-standing compiler. In the past it also produced better code than gcc, but gcc is now so vastly improved w.r.t. optimisation and compliance to more recent standards, that I would be surprised if there is any real use for the Compaq compiler. Cheers, Michael.
Re: [PATCH] x86: Deprecate a.out support
Linus Torvalds writes: > On Mon, Mar 11, 2019 at 2:34 PM Arnd Bergmann wrote: >> >> The main historic use case I've heard of was running Netscape >> Navigator on Alpha Linux, before there was an open source version. >> Doing this today to connect to the open internet is probably >> a bit pointless, but there may be other use cases. > > The _really_ main version was that I decided to make my life easier > for the initial alpha port by trying to run basic (tested) OSF/1 > binaries directly. > > Netscape may have been one of the binaries people actually ended up > using, but it's probably not a reason any more, since the internet has > moved past that anyway. > >> Looking at the system call table in the kernel >> (arch/alpha/kernel/syscalls/syscall.tbl), we seem to support a >> specific subset that was required for a set of applications, and >> not much more. > > Yeah, it never supported arbitrary binaries, particularly since > there's often lots of other issues too with running things like that > (ie filesystem layout etc). It worked for normal fairly well behaved > stuff, but wasn't ever a full OSF/1 emulation environment. > > I _suspect_ nobody actually runs any OSF/1 binaries any more, but it > would obviously be good to verify that. Your argument that timeval > handling was broken _may_ be an indication of that (or may just mean > very few apps care). Does it count if I fire up an Alpha and run a few OSF/1 binaries right now? :-) -- Måns Rullgård
Re: [PATCH] x86: Deprecate a.out support
On Mon, Mar 11, 2019 at 2:34 PM Arnd Bergmann wrote: > > On Mon, Mar 11, 2019 at 8:47 PM Måns Rullgård wrote: > > Linus Torvalds writes: > > > On Mon, Mar 11, 2019 at 11:08 AM Måns Rullgård wrote: > > >> > > >> The latest version I have is 5.1, and that uses ECOFF. > > > > > > ECOFF _is_ a.out as far as Linux is concerned. > > > > > > So Linux basically treats ECOFF as "regular a.out with just some > > > header extensions". > > > > > > We don't have any specific support for ECOFF. > > > > > > I _think_. Again, it's been years and years. > > > > Right, which is why killing a.out entirely would have the unfortunate > > effect of also removing the OSF/1 compatibility on Alpha. > > > > If we are to support Alpha as an architecture at all, it makes sense to > > support the things people actually use it for. > > > > Now, personally I can live without it. I just don't like to see > > features removed without due consideration. > > The main historic use case I've heard of was running Netscape > Navigator on Alpha Linux, before there was an open source version. > Doing this today to connect to the open internet is probably > a bit pointless, but there may be other use cases. The best use case I know of is to run their C compiler. Måns sent patches in fact to make it work. There is a Linux version of the same compiler but I have a vague memory that it's broken in various ways that the Tru64 version is not.
Re: [PATCH] x86: Deprecate a.out support
Arnd Bergmann writes: > On Mon, Mar 11, 2019 at 8:47 PM Måns Rullgård wrote: >> Linus Torvalds writes: >> > On Mon, Mar 11, 2019 at 11:08 AM Måns Rullgård wrote: >> >> >> >> The latest version I have is 5.1, and that uses ECOFF. >> > >> > ECOFF _is_ a.out as far as Linux is concerned. >> > >> > So Linux basically treats ECOFF as "regular a.out with just some >> > header extensions". >> > >> > We don't have any specific support for ECOFF. >> > >> > I _think_. Again, it's been years and years. >> >> Right, which is why killing a.out entirely would have the unfortunate >> effect of also removing the OSF/1 compatibility on Alpha. >> >> If we are to support Alpha as an architecture at all, it makes sense to >> support the things people actually use it for. >> >> Now, personally I can live without it. I just don't like to see >> features removed without due consideration. > > The main historic use case I've heard of was running Netscape > Navigator on Alpha Linux, before there was an open source version. > Doing this today to connect to the open internet is probably > a bit pointless, but there may be other use cases. Once upon a time, I used it to run Matlab. > Looking at the system call table in the kernel > (arch/alpha/kernel/syscalls/syscall.tbl), we seem to support a > specific subset that was required for a set of applications, and > not much more. Old system calls (osf_old_open, osf_execve, > osf_old_sigaction) are listed but not implemented, and the same > is true for most of the later calls (osf_fuser, osf_sigsendset, > osf_waitid, osf_signal, ...), just the ones in the middle are there. > This would also indicate that it never really worked as a > general-purpose emulation layer but was only there for a specific > set of applications. It works for many applications, though I did have to add a few syscalls myself (yes, I sent patches). > Another data point I have is that osf1 emulation was broken > between linux-4.13 and linux-4.16 without anyone noticing, see > 47669fb6b595 ("alpha: osf_sys.c: fix put_tv32 regression"). That's interesting, but it doesn't mean nobody is using it. I tend to run the LTS branches and switch to a new one only once it seems to have settled a bit, so when 4.16 was released, I was probably still running 4.9. I don't think I'm the only one using this strategy. -- Måns Rullgård
Re: [PATCH] x86: Deprecate a.out support
On Mon, Mar 11, 2019 at 2:34 PM Arnd Bergmann wrote: > > The main historic use case I've heard of was running Netscape > Navigator on Alpha Linux, before there was an open source version. > Doing this today to connect to the open internet is probably > a bit pointless, but there may be other use cases. The _really_ main version was that I decided to make my life easier for the initial alpha port by trying to run basic (tested) OSF/1 binaries directly. Netscape may have been one of the binaries people actually ended up using, but it's probably not a reason any more, since the internet has moved past that anyway. > Looking at the system call table in the kernel > (arch/alpha/kernel/syscalls/syscall.tbl), we seem to support a > specific subset that was required for a set of applications, and > not much more. Yeah, it never supported arbitrary binaries, particularly since there's often lots of other issues too with running things like that (ie filesystem layout etc). It worked for normal fairly well behaved stuff, but wasn't ever a full OSF/1 emulation environment. I _suspect_ nobody actually runs any OSF/1 binaries any more, but it would obviously be good to verify that. Your argument that timeval handling was broken _may_ be an indication of that (or may just mean very few apps care). I think we should try the a.out removal and see if anybody notices. Linus
Re: [PATCH] x86: Deprecate a.out support
On Mon, Mar 11, 2019 at 8:47 PM Måns Rullgård wrote: > Linus Torvalds writes: > > On Mon, Mar 11, 2019 at 11:08 AM Måns Rullgård wrote: > >> > >> The latest version I have is 5.1, and that uses ECOFF. > > > > ECOFF _is_ a.out as far as Linux is concerned. > > > > So Linux basically treats ECOFF as "regular a.out with just some > > header extensions". > > > > We don't have any specific support for ECOFF. > > > > I _think_. Again, it's been years and years. > > Right, which is why killing a.out entirely would have the unfortunate > effect of also removing the OSF/1 compatibility on Alpha. > > If we are to support Alpha as an architecture at all, it makes sense to > support the things people actually use it for. > > Now, personally I can live without it. I just don't like to see > features removed without due consideration. The main historic use case I've heard of was running Netscape Navigator on Alpha Linux, before there was an open source version. Doing this today to connect to the open internet is probably a bit pointless, but there may be other use cases. Looking at the system call table in the kernel (arch/alpha/kernel/syscalls/syscall.tbl), we seem to support a specific subset that was required for a set of applications, and not much more. Old system calls (osf_old_open, osf_execve, osf_old_sigaction) are listed but not implemented, and the same is true for most of the later calls (osf_fuser, osf_sigsendset, osf_waitid, osf_signal, ...), just the ones in the middle are there. This would also indicate that it never really worked as a general-purpose emulation layer but was only there for a specific set of applications. Another data point I have is that osf1 emulation was broken between linux-4.13 and linux-4.16 without anyone noticing, see 47669fb6b595 ("alpha: osf_sys.c: fix put_tv32 regression"). Arnd
Re: [PATCH] x86: Deprecate a.out support
On Mon, Mar 11, 2019 at 12:47 PM Måns Rullgård wrote: > > Linus Torvalds writes: > > > On Mon, Mar 11, 2019 at 11:08 AM Måns Rullgård wrote: > >> > >> The latest version I have is 5.1, and that uses ECOFF. > > > > ECOFF _is_ a.out as far as Linux is concerned. > > > > So Linux basically treats ECOFF as "regular a.out with just some > > header extensions". > > > > We don't have any specific support for ECOFF. > > > > I _think_. Again, it's been years and years. > > Right, which is why killing a.out entirely would have the unfortunate > effect of also removing the OSF/1 compatibility on Alpha. > > If we are to support Alpha as an architecture at all, it makes sense to > support the things people actually use it for. I agree. I was not aware that a.out was effectively the same as ECOFF.
Re: [PATCH] x86: Deprecate a.out support
Linus Torvalds writes: > On Mon, Mar 11, 2019 at 11:08 AM Måns Rullgård wrote: >> >> The latest version I have is 5.1, and that uses ECOFF. > > ECOFF _is_ a.out as far as Linux is concerned. > > So Linux basically treats ECOFF as "regular a.out with just some > header extensions". > > We don't have any specific support for ECOFF. > > I _think_. Again, it's been years and years. Right, which is why killing a.out entirely would have the unfortunate effect of also removing the OSF/1 compatibility on Alpha. If we are to support Alpha as an architecture at all, it makes sense to support the things people actually use it for. Now, personally I can live without it. I just don't like to see features removed without due consideration. -- Måns Rullgård
Re: [PATCH] x86: Deprecate a.out support
On Mon, Mar 11, 2019 at 11:08 AM Måns Rullgård wrote: > > The latest version I have is 5.1, and that uses ECOFF. ECOFF _is_ a.out as far as Linux is concerned. So Linux basically treats ECOFF as "regular a.out with just some header extensions". We don't have any specific support for ECOFF. I _think_. Again, it's been years and years. Linus
Re: [PATCH] x86: Deprecate a.out support
On Mon, Mar 11, 2019 at 9:26 AM Måns Rullgård wrote: > > Linus Torvalds writes: > > > On Sun, Mar 10, 2019 at 2:37 PM Matt Turner wrote: > >> > >> I'm not aware of a reason to keep a.out support on alpha. > > > > Hmm. I was looking at removing a.out support entirely, but it's > > actually fairly incestuous on alpha. > > > > For example, arch/alpha/boot/tools/objstrip.c very much has some a.out > > support in it. Maybe it can just be removed entirely. > > > > There's also an a.out.h include in arch/alpha/kernel/binfmt_loader.c. > > > > Finally, note that CONFIG_OSF4_COMPAT also no longer makes sense > > without a.out support. > > Anyone running an Alpha machine likely also has some old OSF/1 binaries > they may wish to use. It would be a shame to remove this feature, IMO. Tru64 5.1 uses ECOFF binaries, I believe. Do you know when OSF/1 / Digital UNIX / Tru64 switched from a.out to ECOFF?
Re: [PATCH] x86: Deprecate a.out support
Linus Torvalds writes: > On Mon, Mar 11, 2019 at 9:26 AM Måns Rullgård wrote: >> >> Anyone running an Alpha machine likely also has some old OSF/1 binaries >> they may wish to use. It would be a shame to remove this feature, IMO. > > If that's the case then we'd have to keep a.out alive for alpha, since > that's the OSF/1 binary format (at least the only one we support - I'm > not sure if later versions of OSF/1 ended up getting ELF). The latest version I have is 5.1, and that uses ECOFF. > Which I guess we could do, but the question is whether people really > do have OSF/1 binaries. It was really useful early on as a source of > known-good binaries to test with, but I'm not convinced it's still in > use. > > It's not like there were OSF/1 binaries that we didn't havce access to > natively (well, there _were_ special ones that didn't have open source > versions, but most of them required more system-side support than > Linux ever implemented, afaik). I don't have any specific examples, but I can well imagine people keeping an Alpha machine for no other reason than the ability to run some (old) application only available (to them) for OSF/1. Running them on Linux rather than Tru64 brings the advantage of being a modern system in other regards. For anything open source, there's little reason to keep the Alpha at all. -- Måns Rullgård
Re: [PATCH] x86: Deprecate a.out support
On Mon, Mar 11, 2019 at 9:26 AM Måns Rullgård wrote: > > Anyone running an Alpha machine likely also has some old OSF/1 binaries > they may wish to use. It would be a shame to remove this feature, IMO. If that's the case then we'd have to keep a.out alive for alpha, since that's the OSF/1 binary format (at least the only one we support - I'm not sure if later versions of OSF/1 ended up getting ELF). Which I guess we could do, but the question is whether people really do have OSF/1 binaries. It was really useful early on as a source of known-good binaries to test with, but I'm not convinced it's still in use. It's not like there were OSF/1 binaries that we didn't havce access to natively (well, there _were_ special ones that didn't have open source versions, but most of them required more system-side support than Linux ever implemented, afaik). Linus
Re: [PATCH] x86: Deprecate a.out support
Linus Torvalds writes: > On Sun, Mar 10, 2019 at 2:37 PM Matt Turner wrote: >> >> I'm not aware of a reason to keep a.out support on alpha. > > Hmm. I was looking at removing a.out support entirely, but it's > actually fairly incestuous on alpha. > > For example, arch/alpha/boot/tools/objstrip.c very much has some a.out > support in it. Maybe it can just be removed entirely. > > There's also an a.out.h include in arch/alpha/kernel/binfmt_loader.c. > > Finally, note that CONFIG_OSF4_COMPAT also no longer makes sense > without a.out support. Anyone running an Alpha machine likely also has some old OSF/1 binaries they may wish to use. It would be a shame to remove this feature, IMO. -- Måns Rullgård
[PATCH] y2038: fix socket.h header inclusion
Referencing the __kernel_long_t type caused some user space applications to stop compiling when they had not already included linux/posix_types.h, e.g. s/multicast.c -o ext/sockets/multicast.lo In file included from /builddir/build/BUILD/php-7.3.3/main/php.h:468, from /builddir/build/BUILD/php-7.3.3/ext/sockets/sockets.c:27: /builddir/build/BUILD/php-7.3.3/ext/sockets/sockets.c: In function 'zm_startup_sockets': /builddir/build/BUILD/php-7.3.3/ext/sockets/sockets.c:776:40: error: '__kernel_long_t' undeclared (first use in this function) 776 | REGISTER_LONG_CONSTANT("SO_SNDTIMEO", SO_SNDTIMEO, CONST_CS | CONST_PERSISTENT); It is safe to include that header here, since it only contains kernel internal types that do not conflict with other user space types. It's still possible that some related build failures remain, but those are likely to be for code that is not already y2038 safe. Reported-by: Laura Abbott Fixes: a9beb86ae6e5 ("sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW") Signed-off-by: Arnd Bergmann --- arch/alpha/include/uapi/asm/socket.h | 2 +- arch/mips/include/uapi/asm/socket.h | 2 +- arch/parisc/include/uapi/asm/socket.h | 2 +- arch/sparc/include/uapi/asm/socket.h | 2 +- include/uapi/asm-generic/socket.h | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h index 0d0fddb7e738..976e89b116e5 100644 --- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h @@ -2,8 +2,8 @@ #ifndef _UAPI_ASM_SOCKET_H #define _UAPI_ASM_SOCKET_H +#include #include -#include /* For setsockopt(2) */ /* diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h index eb9f33f8a8b3..d41765cfbc6e 100644 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h @@ -10,8 +10,8 @@ #ifndef _UAPI_ASM_SOCKET_H #define _UAPI_ASM_SOCKET_H +#include #include -#include /* * For setsockopt(2) diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h index 16e428f03526..66c5dd245ac7 100644 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h @@ -2,8 +2,8 @@ #ifndef _UAPI_ASM_SOCKET_H #define _UAPI_ASM_SOCKET_H +#include #include -#include /* For setsockopt(2) */ #define SOL_SOCKET 0x diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h index 88fe4f978aca..9265a9eece15 100644 --- a/arch/sparc/include/uapi/asm/socket.h +++ b/arch/sparc/include/uapi/asm/socket.h @@ -2,8 +2,8 @@ #ifndef _ASM_SOCKET_H #define _ASM_SOCKET_H +#include #include -#include /* For setsockopt(2) */ #define SOL_SOCKET 0x diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index c8b430cb6dc4..8c1391c89171 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -2,8 +2,8 @@ #ifndef __ASM_GENERIC_SOCKET_H #define __ASM_GENERIC_SOCKET_H +#include #include -#include /* For setsockopt(2) */ #define SOL_SOCKET 1 -- 2.20.0
[PATCH] arch: alpha: Kconfig: pedantic formatting
Formatting of Kconfig files doesn't look so pretty, so let the Great White Handkerchief come around and clean it up. Signed-off-by: Enrico Weigelt, metux IT consult --- arch/alpha/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index 584a6e1..ca70946 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -550,7 +550,7 @@ config NR_CPUS default "4" if !ALPHA_GENERIC && !ALPHA_MARVEL help MARVEL support can handle a maximum of 32 CPUs, all the others - with working support have a maximum of 4 CPUs. + with working support have a maximum of 4 CPUs. config ARCH_DISCONTIGMEM_ENABLE bool "Discontiguous Memory Support" @@ -662,7 +662,7 @@ choice endchoice config HZ - int + int default 32 if HZ_32 default 64 if HZ_64 default 128 if HZ_128 -- 1.9.1
Re: [PATCH] x86: Deprecate a.out support
On Sun, Mar 10, 2019 at 11:46 PM Linus Torvalds wrote: > > On Sun, Mar 10, 2019 at 2:37 PM Matt Turner wrote: > > > > I'm not aware of a reason to keep a.out support on alpha. > > Hmm. I was looking at removing a.out support entirely, but it's > actually fairly incestuous on alpha. > > For example, arch/alpha/boot/tools/objstrip.c very much has some a.out > support in it. Maybe it can just be removed entirely. > > There's also an a.out.h include in arch/alpha/kernel/binfmt_loader.c. > > Finally, note that CONFIG_OSF4_COMPAT also no longer makes sense > without a.out support. > > So this attached patch does not compile on alpha, but it's been many > many years since I had an alpha to test with, so I'm stuck. > > Matt, can you fill in the details and complete this patch? I wonder if we could remove the osf time32 compat code as well, this was one of the areas that kept causing problems with the y2038 rework. (I think it's all good now, but it's never been tested as far as I can tell). For some syscalls (e.g. brk, mmap, getxuid, ...) we definitely need to keep the osf1 version, since it is the only supported ABI. I just looked up some really old source trees and found that glibc-2.1 was the first release to use 64-bit time_t the way we do it today, as implemented in [1], so all Debian and SuSE releases for alpha had it, but any ELF binaries built on Red Hat Linux 4.x and 5.x (released 1996 through 1998) or earlier would use 32-bit time_t osf1 syscalls. Red Hat 2.x and 3.x were a.out based on alpha. Arnd [1] https://repo.or.cz/glibc/history.git/commitdiff/64819b5c3a94e81e4
Re: [PATCH] x86: Deprecate a.out support
Hi Linus! On 3/11/19 7:40 AM, Linus Torvalds wrote: > So this attached patch does not compile on alpha, but it's been many > many years since I had an alpha to test with, so I'm stuck. Michael Cree (CC'ed) has several Alpha servers running which are also used for building Debian packages and testing kernels. I also have four AlphaStations (233, 433au and XP-1000) on which I could test any patch. But I'm currently in Japan until Friday next week, so I don't have any means to access the machines. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: [PATCH] x86: Deprecate a.out support
On Sun, Mar 10, 2019 at 03:40:20PM -0700, Linus Torvalds wrote: > SYSCALL_DEFINE3(osf_readv, unsigned long, fd, > const struct iovec __user *, vector, unsigned long, count) > { > -#ifdef CONFIG_OSF4_COMPAT > - if (unlikely(personality(current->personality) == PER_OSF4)) > - if (osf_fix_iov_len(vector, count)) > - return -EFAULT; > -#endif > - > return sys_readv(fd, vector, count); > } > > SYSCALL_DEFINE3(osf_writev, unsigned long, fd, > const struct iovec __user *, vector, unsigned long, count) > { > -#ifdef CONFIG_OSF4_COMPAT > - if (unlikely(personality(current->personality) == PER_OSF4)) > - if (osf_fix_iov_len(vector, count)) > - return -EFAULT; > -#endif > return sys_writev(fd, vector, count); > } Might as well kill those two off, while we are at it - just use sys_readv/sys_writev in arch/alpha/kernel/syscalls/syscall.tbl and be done with that...
Re: [PATCH] x86: Deprecate a.out support
On Sun, Mar 10, 2019 at 2:37 PM Matt Turner wrote: > > I'm not aware of a reason to keep a.out support on alpha. Hmm. I was looking at removing a.out support entirely, but it's actually fairly incestuous on alpha. For example, arch/alpha/boot/tools/objstrip.c very much has some a.out support in it. Maybe it can just be removed entirely. There's also an a.out.h include in arch/alpha/kernel/binfmt_loader.c. Finally, note that CONFIG_OSF4_COMPAT also no longer makes sense without a.out support. So this attached patch does not compile on alpha, but it's been many many years since I had an alpha to test with, so I'm stuck. Matt, can you fill in the details and complete this patch? Linus arch/alpha/Kconfig | 1 - arch/alpha/kernel/osf_sys.c | 30 arch/m68k/Kconfig | 1 - arch/x86/Kconfig| 7 - arch/x86/ia32/Makefile | 2 - arch/x86/ia32/ia32_aout.c | 330 -- fs/Kconfig.binfmt | 33 - fs/Makefile | 1 - fs/binfmt_aout.c| 343 include/linux/a.out.h | 18 --- 10 files changed, 766 deletions(-) diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index 584a6e114853..9b9770b45f36 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -10,7 +10,6 @@ config ALPHA select FORCE_PCI if !ALPHA_JENSEN select PCI_DOMAINS if PCI select PCI_SYSCALL if PCI - select HAVE_AOUT select HAVE_IDE select HAVE_OPROFILE select HAVE_PCSPKR_PLATFORM diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c index bf497b8b0ec6..09a0746c9681 100644 --- a/arch/alpha/kernel/osf_sys.c +++ b/arch/alpha/kernel/osf_sys.c @@ -1342,45 +1342,15 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, return addr; } -#ifdef CONFIG_OSF4_COMPAT -/* Clear top 32 bits of iov_len in the user's buffer for - compatibility with old versions of OSF/1 where iov_len - was defined as int. */ -static int -osf_fix_iov_len(const struct iovec __user *iov, unsigned long count) -{ - unsigned long i; - - for (i = 0 ; i < count ; i++) { - int __user *iov_len_high = (int __user *)[i].iov_len + 1; - - if (put_user(0, iov_len_high)) - return -EFAULT; - } - return 0; -} -#endif - SYSCALL_DEFINE3(osf_readv, unsigned long, fd, const struct iovec __user *, vector, unsigned long, count) { -#ifdef CONFIG_OSF4_COMPAT - if (unlikely(personality(current->personality) == PER_OSF4)) - if (osf_fix_iov_len(vector, count)) - return -EFAULT; -#endif - return sys_readv(fd, vector, count); } SYSCALL_DEFINE3(osf_writev, unsigned long, fd, const struct iovec __user *, vector, unsigned long, count) { -#ifdef CONFIG_OSF4_COMPAT - if (unlikely(personality(current->personality) == PER_OSF4)) - if (osf_fix_iov_len(vector, count)) - return -EFAULT; -#endif return sys_writev(fd, vector, count); } diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig index b54206408f91..65d263c60669 100644 --- a/arch/m68k/Kconfig +++ b/arch/m68k/Kconfig @@ -8,7 +8,6 @@ config M68K select ARCH_NO_COHERENT_DMA_MMAP if !MMU select ARCH_NO_PREEMPT if !COLDFIRE select HAVE_IDE - select HAVE_AOUT if MMU select HAVE_DEBUG_BUGVERBOSE select GENERIC_IRQ_SHOW select GENERIC_ATOMIC64 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index c1f9b3cf437c..4a9438e4fba6 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2836,13 +2836,6 @@ config IA32_EMULATION 64-bit kernel. You should likely turn this on, unless you're 100% sure that you don't have any 32-bit programs left. -config IA32_AOUT - tristate "IA32 a.out support" - depends on IA32_EMULATION - depends on BROKEN - ---help--- - Support old a.out binaries in the 32bit emulation. - config X86_X32 bool "x32 ABI for 64-bit mode" depends on X86_64 diff --git a/arch/x86/ia32/Makefile b/arch/x86/ia32/Makefile index cd4339bae066..b98fedaa7642 100644 --- a/arch/x86/ia32/Makefile +++ b/arch/x86/ia32/Makefile @@ -4,7 +4,5 @@ obj-$(CONFIG_IA32_EMULATION) := sys_ia32.o ia32_signal.o -obj-$(CONFIG_IA32_AOUT) += ia32_aout.o - audit-class-$(CONFIG_AUDIT) := audit.o obj-$(CONFIG_IA32_EMULATION) += $(audit-class-y) diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c deleted file mode 100644 index 3c135084e1eb.. --- a/arch/x86/ia32/ia32_aout.c +++ /dev/null @@ -1,330 +0,0 @@ -/* - * a.out loader for x86-64 - * - * Copyright (C) 1991, 1992, 1996 Linus Torvalds - * Hacked together by Andi Kleen - */ - -#include - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include -#include -#include - -#undef WARN_OLD - -static int load_aout_binary(struct linux_binprm *); -static int load_aout_library(struct file *); - -static struct linux_binfmt
Re: [PATCH] x86: Deprecate a.out support
On Tue, Mar 5, 2019 at 11:04 AM Borislav Petkov wrote: > > On Tue, Mar 05, 2019 at 07:11:38PM +0100, Borislav Petkov wrote: > > I guess you could Cc arch maintainers with the a.out-core.h removal > > patch to see if anyone screams. > > And they're like two for which we need confirmation: > > $ git ls-files | grep a.out-core.h > arch/alpha/include/asm/a.out-core.h > arch/m68k/include/asm/a.out-core.h > arch/um/include/asm/a.out-core.h > arch/x86/include/asm/a.out-core.h > > um and x86 are clear. > > Adding alpha and m68k MLs to Cc. I'm not aware of a reason to keep a.out support on alpha.
Re: [PATCH] x86: Deprecate a.out support
Hi Borislav, On Tue, Mar 5, 2019 at 8:04 PM Borislav Petkov wrote: > On Tue, Mar 05, 2019 at 07:11:38PM +0100, Borislav Petkov wrote: > > I guess you could Cc arch maintainers with the a.out-core.h removal > > patch to see if anyone screams. > > And they're like two for which we need confirmation: > > $ git ls-files | grep a.out-core.h > arch/alpha/include/asm/a.out-core.h > arch/m68k/include/asm/a.out-core.h > arch/um/include/asm/a.out-core.h > arch/x86/include/asm/a.out-core.h > > um and x86 are clear. > > Adding alpha and m68k MLs to Cc. Thanks! The oldest binaries I still have lying around (an ext2 ramdisk image, still used from time to time) are ELF, from just after the a.out to ELF transition on m68k. I think it's safe to assume no one still runs a.out binaries on m68k. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH] x86: Deprecate a.out support
On Tue, Mar 05, 2019 at 07:11:38PM +0100, Borislav Petkov wrote: > I guess you could Cc arch maintainers with the a.out-core.h removal > patch to see if anyone screams. And they're like two for which we need confirmation: $ git ls-files | grep a.out-core.h arch/alpha/include/asm/a.out-core.h arch/m68k/include/asm/a.out-core.h arch/um/include/asm/a.out-core.h arch/x86/include/asm/a.out-core.h um and x86 are clear. Adding alpha and m68k MLs to Cc. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH] add delay between port write and port read
On Fri, 1 Mar 2019, Linus Torvalds wrote: > > Without that, using __raw_xyz() to copy between RAM and > > buffers in PCI memory space is broken, as you said, but the > > assumption would be broken on certain older machines that > > do a hardware endian swap by swizzling the address lines rather > > than swapping bytes on the data bus. > > Honestly, there's no way we can ever fix that. Well, (: the good news is we actually have it covered in the MIPS port already, with the interfaces I described in the other e-mail. We have had it for some 15 years now, ever since we got the ATA PIO stuff covered. So all that we need is to agree upon names so that everyone does not come up with their own ones, write them down somewhere, and let individual port maintainers sort it out as they deem necessary. This way we have a solution available, but don't have to do anything. Win-win. > Because no new CPU's will ever be designed where the byte order isn't > little-endian, and nobody will ever make those broken "we'll do random > things in hardware to worfk around the fact that we're doing crazy > things" machines. Even better then: the trasformation will be trivial, i.e. pass-through, so all the new CPU stuff will merely alias all the accessor variants to each other. Which means we can put all that as a bunch of #defines in include/asm-generic/io.h and forget about it. Maciej
Re: [PATCH] add delay between port write and port read
On Fri, Mar 1, 2019 at 1:52 PM Arnd Bergmann wrote: > > Without that, using __raw_xyz() to copy between RAM and > buffers in PCI memory space is broken, as you said, but the > assumption would be broken on certain older machines that > do a hardware endian swap by swizzling the address lines rather > than swapping bytes on the data bus. Honestly, there's no way we can ever fix that. The fact is, __raw_readl/writel simply isn't portable. If a driver uses them, the driver HAS TO KNOW what the situation is on the hardware it runs on. Because the "byte order" of the end result will basically be random on some machines. And you know what? We absolutely do not care. Not one whit. Because no new CPU's will ever be designed where the byte order isn't little-endian, and nobody will ever make those broken "we'll do random things in hardware to worfk around the fact that we're doing crazy things" machines. So it's all legacy anyway, and it's absolutely not worth over-designing or over-thinking this issue. If you have a random SoC that does insane things, that random SoC may (a) need its own drivers that know exactly how it works (b) not be able to use some drivers for IP that the SoC has, but the SoC designers screwed up and did some random crazy byte lane swizzling and that's all perfectly fine. We don't care. That SoC gets to do its own drivers. And 99% of truly generic drivers won't have any of these issues, because they don't use __raw_readl/writel to begin with. So don't make this any stranger than it is. Just accept that by design, and by definition, __raw_readl/writel is a bit non-portable, and won't work for everything. The corollary of that is that __raw_readl/writel() may also then be used _for_ that mis-designed piece of garbage SoC to implement the special driver needed for just that SoC. But as mentioned, this simply isn't a problem in practice. Sure, people can do insane things if they really want to. Maybe some "stable genius" decides to do a big-endian RISC-V core. We don't care, and we don't make everybody else have to worry about that kind of insanity. Linus
Re: [PATCH] add delay between port write and port read
On Fri, 1 Mar 2019, Arnd Bergmann wrote: > > I think the whole point of __raw_xyz() is that it's the lowest level > > model. It gives you relaxed ordering (together with the ioremap > > model), and it gives you straight-through behavior. > > > > And yes, any driver using them needs to be aware of the byte ordering, > > which may or may not be the same as regular memory, and may or may not > > be the same as other devices. > > > > So __raw_xyz() is very much for low-level drivers that know what they > > are doing. Caveat user. > > > > "If it breaks, you get to keep both pieces" > > I agree in principle, but I think we already have a lot of precedence > for __raw_xyz() being relied on having a specific behavior in > architecture independent drivers, and I think it makes sense for > architectures to provide that. > > Specifically, I think we need __raw_xyz() to do the same as xyz() > on all little-endian kernels regarding byte ordering (not barriers), and > I would expect it to provide the same ordering and addressing > as swabX(xyz()) on big-endian kernels. > > Without that, using __raw_xyz() to copy between RAM and > buffers in PCI memory space is broken, as you said, but the > assumption would be broken on certain older machines that > do a hardware endian swap by swizzling the address lines rather > than swapping bytes on the data bus. Ah, swizzling! Thanks for reminding me of that. The swizzling machines will be those that do bit (rather than byte) lane matching in hardware and consequently produce a numeric value of data that is consistent between the CPU and a device when accessed in bus-width quantities. These do indeed require address adjustment if you access a narrower quantity. > The best idea I have for working around this is to never rely > on __raw_xyz() to not do byte swapping in platform specific > drivers with CPU-endian MMIO space, but to have a platform > specific set of wrappers around the normal I/O functions, and > make __raw_xyz() just do whatever we expect them to do on > PCI devices. For the record in the MIPS port we have the following accessors: 1. `readX', `writeX', `inX', `outX': data passed in the CPU endianness, ordered WRT MMIO and (for reads) WRT memory; suitable for device CSR accesses, 2. `__mem_readX', `__mem_writeX', `__mem_inX', `__mem_outX': data passed in the memory endianness, ordered WRT MMIO and (for reads) WRT memory, suitable for data transfers between memory and a device, e.g. PIO ATA, 3. `__relaxed_readX', `__relaxed_writeX': data passed in the CPU endianness, ordered WRT MMIO only; aliased to `*_relaxed', 4. `__raw_readX', `__raw_writeX': data passed straight-through with no transformation, ordered WRT MMIO and (for reads) WRT memory. For correct operation of generic drivers we need at least #1 and #2, and we need #4 for platform drivers (which will know they're on a local bus); #3 is an optimisation only that is nice having, but drivers would do with #1 instead. Note however that depending on the wiring of a particular big-endian machine we may have to swizzle addresses for #1, #2, #3, and then byte-swap either #1 and #3, or #2. See arch/mips/include/asm/*/mangle-port.h for all the mess. It actually took quite a while back in early 2000s to get PIO ATA to work correctly on big-endian MIPS machines, because people did not understand the subtlety between the CPU and the memory endianness, and consequently that different accessors are required for the 16-bit ATA data register; you just can't get 16-bit ATA data transfers and 16-bit CSR accesses made elsewhere right with the use of a single accessor only, we need two. I haven't looked at what other ports did in this area, but we do need at least #1 and #2, strongly-ordered, kernel-wide, and then we can discuss what the semantics of `__raw_xyz' is supposed to be, but at least some ports will have to retain the semantics of #4, strongly-ordered, under whatever name. I think using a uniform one would be advantageous though, for obvious reasons. NB the old IDE driver had special local provisions for PIO, which wired #2 accessors on MIPS by including a platform specific header that provided IDE-specific names and I suppose libata carried that over or it wouldn't work. But I think that instead of having such driver-specific hacks that require port maintainers' attention on a case-by-case basis we really ought to provide #2 semantics kernel-wide. Maciej
Re: [PATCH] add delay between port write and port read
On Fri, Mar 1, 2019 at 8:19 PM Linus Torvalds wrote: > > On Fri, Mar 1, 2019 at 11:13 AM Maciej W. Rozycki > wrote: > > > > What do we do WRT straight-through vs byte-swapping properties of these > > accessors? > > I think the whole point of __raw_xyz() is that it's the lowest level > model. It gives you relaxed ordering (together with the ioremap > model), and it gives you straight-through behavior. > > And yes, any driver using them needs to be aware of the byte ordering, > which may or may not be the same as regular memory, and may or may not > be the same as other devices. > > So __raw_xyz() is very much for low-level drivers that know what they > are doing. Caveat user. > > "If it breaks, you get to keep both pieces" I agree in principle, but I think we already have a lot of precedence for __raw_xyz() being relied on having a specific behavior in architecture independent drivers, and I think it makes sense for architectures to provide that. Specifically, I think we need __raw_xyz() to do the same as xyz() on all little-endian kernels regarding byte ordering (not barriers), and I would expect it to provide the same ordering and addressing as swabX(xyz()) on big-endian kernels. Without that, using __raw_xyz() to copy between RAM and buffers in PCI memory space is broken, as you said, but the assumption would be broken on certain older machines that do a hardware endian swap by swizzling the address lines rather than swapping bytes on the data bus. The best idea I have for working around this is to never rely on __raw_xyz() to not do byte swapping in platform specific drivers with CPU-endian MMIO space, but to have a platform specific set of wrappers around the normal I/O functions, and make __raw_xyz() just do whatever we expect them to do on PCI devices. Arnd
Re: [PATCH] add delay between port write and port read
On Fri, 1 Mar 2019, Linus Torvalds wrote: > Because quite often you don't want any extra byte ordering because > you've moving things around anyway (ie you're copying from the device > to memory or similar, and switching to little-endian in between would > just mean that you have to switch back for the memory write anyway). Good point! Maciej
Re: [PATCH] add delay between port write and port read
On Fri, Mar 1, 2019 at 11:13 AM Maciej W. Rozycki wrote: > > What do we do WRT straight-through vs byte-swapping properties of these > accessors? I think the whole point of __raw_xyz() is that it's the lowest level model. It gives you relaxed ordering (together with the ioremap model), and it gives you straight-through behavior. And yes, any driver using them needs to be aware of the byte ordering, which may or may not be the same as regular memory, and may or may not be the same as other devices. So __raw_xyz() is very much for low-level drivers that know what they are doing. Caveat user. "If it breaks, you get to keep both pieces" Linus
Re: [PATCH] add delay between port write and port read
On Fri, 1 Mar 2019, Linus Torvalds wrote: > So that would seem what an architecture implementation should _aim_ > for: having various "ioremap_xyz()" for setting the > PCIe/system/whatever controller level ordering, and then using the > "__raw_xyz()" accessors for unordered CPU accesses. What do we do WRT straight-through vs byte-swapping properties of these accessors? For the record: we do need to have straight-through accessors for endianness-agnostic peripherals. This is the case for example with SOC devices that have plenty of I/O onchip that are always accessed natively regardless of the endianness of the external bus. E.g. with the Broadcom SiByte BCM1250 SOC we have a whole lot of devices onchip (including a pair of MIPS64 CPU cores), and then PCI and HT endpoints, and the SOC's endianness is configurable at reset. The endianness selected only applies to the external bus, affecting external data lanes: in the big-endian case there's a choice between matching byte lanes or bit lanes by using one of the two MMIO physical address spaces that apply either of these policies; in the little-endian case access is pass-through of course. Consequently we have to use straight-through accessors for onchip devices and byte-swapped ones for PCI/HT devices. Maciej
Re: [PATCH] add delay between port write and port read
On Fri, Mar 1, 2019 at 10:03 AM Maciej W. Rozycki wrote: > > Well, `__raw_*' accessors are never byte-swapped, not at least with the > MIPS port, making them a tad cumbersome for a driver that has no interest > in paying attention to any endianness mismatch between the CPU bus and the > device's peripheral bus. Well, the people who want ultimate performance and not worry about access ordering almost always _also_ want to handle byte ordering manually. Because quite often you don't want any extra byte ordering because you've moving things around anyway (ie you're copying from the device to memory or similar, and switching to little-endian in between would just mean that you have to switch back for the memory write anyway). Does it make things more complicated for a driver when it has to care about byte ordering and not just say "PCI is always little-endian"? Yes. But if you want simple, you shouldn't be doing the unordered accesses. Linus
Re: [PATCH] add delay between port write and port read
Hi Will, > > FAOD, I think this assumption/requirement only applies to the plain > > accessors (`inX', `readX', `ioreadX', etc.). > > It's also a requirement for the *_relaxed accessors, and there are drivers > that rely on this being the case. Well, from the reading of memory-barriers.txt, be it as it stands, or with your rewording applied, I take it the `*_relaxed' accessors do not guarantee ordering WRT locking or DMA and hence a trailing barrier in `readX_relaxed' is not necessary, and so we don't need a trailing `mb' with the Alpha port either (but we do need a leading `mb' there, as well as with `writeX_relaxed'). > > For performance reasons we may decide sometime to opt in for accessors > > that do not suffer from the requirement to be strongly ordered WRT each > > other, for the benefit to architectures that are not strongly ordered with > > MMIO and that suffer a lot from serialising accesses that do not really > > care, e.g. where you need to load a bunch of device registers or maybe > > even device RAM in any order before making a serialised final request to > > accept the values loaded. > > I'd expect accesses to device RAM to use something like ioremap_wc() if > possible. In that case, the ordering of accesses is weakened by the > underlying memory type in the page tables, but we're not yet at the point > where we've figured out the portable semantics in this case. I plan to > look at that once we've nailed normal ioremap()! Some CPU hardware and certainly many if not most MIPS implementations do not give such a finegrained control over mapping. I think all the MIPS CPUs I dealt with only gave the choice between a cached (coherent or not) and an uncached mapping, with the ordering being weak in all cases. So the only control over ordering was with explicit barrier instructions. > > That piece of hardware is however rather peculiar and not an example of > > the most common design seen nowadays and I am not sure if the extra > > maintenance burden across all the ports for any additional accessors would > > be outweighed by the benefit for the weakly ordered MMIO architectures > > (where an execution stall can indeed count in hundreds of clock cycles per > > barrier inserted) combined with the appreciation (i.e. actual use) level > > from driver writers who do not necessarily grok all that weak ordering > > business. > > If you use ioremap_wc() for device RAM and __raw_readX() for bunching up > register accesses, then I don't think we need to add anything new, do we? Well, `__raw_*' accessors are never byte-swapped, not at least with the MIPS port, making them a tad cumbersome for a driver that has no interest in paying attention to any endianness mismatch between the CPU bus and the device's peripheral bus. Granted this does not matter for this driver as it only has a chance to be used with DECstation (MIPS), DEC 3000 AXP (Alpha) and VAXstation 4000 (VAX) hardware, all of which are little-endian throughout (and not all of which we have support for upstream). Still I'd consider it a hack as obviously we do have drivers for options living on say PCI, which is little-endian, that we want to use with processors configured for the big endianness with their FSB. Maciej
Re: [PATCH] add delay between port write and port read
On Fri, Mar 1, 2019 at 9:33 AM Will Deacon wrote: > > I'd expect accesses to device RAM to use something like ioremap_wc() if > possible. In that case, the ordering of accesses is weakened by the > underlying memory type in the page tables, but we're not yet at the point > where we've figured out the portable semantics in this case. I plan to > look at that once we've nailed normal ioremap()! The case that matters most from a performance standpoint is traditionally stupid framebuffer accesses, and the fb layer has basically standardized this: #define fb_writel __raw_writel together with various fb drivers then using "wmb()" etc for ordering for the non-framebuffer effects. So that would seem what an architecture implementation should _aim_ for: having various "ioremap_xyz()" for setting the PCIe/system/whatever controller level ordering, and then using the "__raw_xyz()" accessors for unordered CPU accesses. Linus
Re: [PATCH] add delay between port write and port read
Hi Maciej, On Wed, Feb 27, 2019 at 06:49:57PM +, Maciej W. Rozycki wrote: > On Tue, 26 Feb 2019, Will Deacon wrote: > > > > If they are the same device (just different data ports), I'd > > > *definitely* expect them to be ordered. > > > > > > We have tons of code that depends on that. Almost every driver out > > > there, in fact. > > > > > > So we need the mb() on alpha to guarantee the access ordering on the > > > CPU side, and then PCI itself ends up guaranteeing that accesses to > > > the same device will remain ordered outside the CPU. > > > > > > Agreed? > > > > Yup, agreed. I'd consider all those ports to be the same endpoint, so we're > > good. > > FAOD, I think this assumption/requirement only applies to the plain > accessors (`inX', `readX', `ioreadX', etc.). It's also a requirement for the *_relaxed accessors, and there are drivers that rely on this being the case. > For performance reasons we may decide sometime to opt in for accessors > that do not suffer from the requirement to be strongly ordered WRT each > other, for the benefit to architectures that are not strongly ordered with > MMIO and that suffer a lot from serialising accesses that do not really > care, e.g. where you need to load a bunch of device registers or maybe > even device RAM in any order before making a serialised final request to > accept the values loaded. I'd expect accesses to device RAM to use something like ioremap_wc() if possible. In that case, the ordering of accesses is weakened by the underlying memory type in the page tables, but we're not yet at the point where we've figured out the portable semantics in this case. I plan to look at that once we've nailed normal ioremap()! > I made provisions for that with a driver I recently added with commit > 61414f5ec983 ("FDDI: defza: Add support for DEC FDDIcontroller 700 > TURBOchannel adapter"), where locally defined accessor macros suffixed > with `_o' and `_u' denote accesses that have to be strongly ordered and > can be weakly ordered respectively WRT each other. > > Right now they all expand to the respective `_relaxed' accessors (with a > lone `dma_rmb' inserted appropriately; yes, the device does DMA one way > only, and the other one is PIO with a lot of MMIO traffic to board RAM > that would benefit from omitting barriers), however they can be replaced > with references to truly unordered accessors if we ever have them. > > That piece of hardware is however rather peculiar and not an example of > the most common design seen nowadays and I am not sure if the extra > maintenance burden across all the ports for any additional accessors would > be outweighed by the benefit for the weakly ordered MMIO architectures > (where an execution stall can indeed count in hundreds of clock cycles per > barrier inserted) combined with the appreciation (i.e. actual use) level > from driver writers who do not necessarily grok all that weak ordering > business. If you use ioremap_wc() for device RAM and __raw_readX() for bunching up register accesses, then I don't think we need to add anything new, do we? Will
Re: [PATCH] add delay between port write and port read
On Wed, Feb 27, 2019 at 10:54 AM Maciej W. Rozycki wrote: > > For that reason though we don't have the trailing barrier in the > `readX_relaxed' accessors in the MIPS port. Makes sense. And alpha should probably follow suit. Linus
Re: [PATCH] add delay between port write and port read
On Wed, 27 Feb 2019, Linus Torvalds wrote: > > Should "writeb_relaxed" on Alpha also use the barrier? > > I think they should. Only the double-underscore (__raw_xyz()) ones > are entirely unordered, the relaxed ones are just unordered wrt > regular memory and DMA. For that reason though we don't have the trailing barrier in the `readX_relaxed' accessors in the MIPS port. Maciej
Re: [PATCH] add delay between port write and port read
On Tue, 26 Feb 2019, Will Deacon wrote: > > If they are the same device (just different data ports), I'd > > *definitely* expect them to be ordered. > > > > We have tons of code that depends on that. Almost every driver out > > there, in fact. > > > > So we need the mb() on alpha to guarantee the access ordering on the > > CPU side, and then PCI itself ends up guaranteeing that accesses to > > the same device will remain ordered outside the CPU. > > > > Agreed? > > Yup, agreed. I'd consider all those ports to be the same endpoint, so we're > good. FAOD, I think this assumption/requirement only applies to the plain accessors (`inX', `readX', `ioreadX', etc.). For performance reasons we may decide sometime to opt in for accessors that do not suffer from the requirement to be strongly ordered WRT each other, for the benefit to architectures that are not strongly ordered with MMIO and that suffer a lot from serialising accesses that do not really care, e.g. where you need to load a bunch of device registers or maybe even device RAM in any order before making a serialised final request to accept the values loaded. I made provisions for that with a driver I recently added with commit 61414f5ec983 ("FDDI: defza: Add support for DEC FDDIcontroller 700 TURBOchannel adapter"), where locally defined accessor macros suffixed with `_o' and `_u' denote accesses that have to be strongly ordered and can be weakly ordered respectively WRT each other. Right now they all expand to the respective `_relaxed' accessors (with a lone `dma_rmb' inserted appropriately; yes, the device does DMA one way only, and the other one is PIO with a lot of MMIO traffic to board RAM that would benefit from omitting barriers), however they can be replaced with references to truly unordered accessors if we ever have them. That piece of hardware is however rather peculiar and not an example of the most common design seen nowadays and I am not sure if the extra maintenance burden across all the ports for any additional accessors would be outweighed by the benefit for the weakly ordered MMIO architectures (where an execution stall can indeed count in hundreds of clock cycles per barrier inserted) combined with the appreciation (i.e. actual use) level from driver writers who do not necessarily grok all that weak ordering business. Maciej
Re: [PATCH] add delay between port write and port read
On Wed, 27 Feb 2019, Linus Torvalds wrote: > > I suppose you might need the mb() before *and* after the I/O access in the > > read case. The idea with readX()/ioreadX() is that you should be able to > > do something like: > > Yeah, that sounds reasonable. > > You might relax the barrier after the readX() to just a rmb(), which > might make the performance impact slightly less noticeable and might > be sufficient in practice. But I guess once you do IO, it's not like > the CPU barrier will be the limiting case. FWIW in the MIPS port we've had it as `rmb' since Sinan's commit a1cc7034e33d ("MIPS: io: Add barrier after register read in readX()"), but then we expand this macro to a hardware SYNC instruction anyway (rather than SYNC_RMB or at most SYNC_MB), which is stronger these days even as it has been at one point redefined in the architecture as a completion rather than an ordering barrier. NB MIPS also has SYNC_ACQUIRE and SYNC_RELEASE too, which are assymetric load <= load/store and load/store <= store ordering barriers respectively, obviously meant for locking. > So maybe just a full mb() and see if anybody notices. Better to have > working code than random failures. For Alpha it doesn't matter anyway as it doesn't have a separate read barrier. Although I'd prefer to have the expected semantics recorded even if the underlying implementation is the same, as otherwise it gets even more confusing to people than it already is. Maciej
Re: [PATCH] add delay between port write and port read
On Wed, 27 Feb 2019, Sinan Kaya wrote: > What we missed is the fact that alpha reorders accesses across two > register accesses. This is guaranteed in other architectures. Not for MIPS either. Maciej
Re: [PATCH] add delay between port write and port read
On Wed, Feb 27, 2019 at 9:26 AM Will Deacon wrote: > > I suppose you might need the mb() before *and* after the I/O access in the > read case. The idea with readX()/ioreadX() is that you should be able to > do something like: Yeah, that sounds reasonable. You might relax the barrier after the readX() to just a rmb(), which might make the performance impact slightly less noticeable and might be sufficient in practice. But I guess once you do IO, it's not like the CPU barrier will be the limiting case. That said, excessively weakly ordered CPU's are horrible for a reason, and the barrier model was broken and made them even more so. And alpha is a posted boy for "don't let hw people design a memory odering just to make it easy for them". So maybe just a full mb() and see if anybody notices. Better to have working code than random failures. Linus
Re: [PATCH] add delay between port write and port read
On Wed, Feb 27, 2019 at 9:05 AM Mikulas Patocka wrote: > > Should "writeb_relaxed" on Alpha also use the barrier? I think they should. Only the double-underscore (__raw_xyz()) ones are entirely unordered, the relaxed ones are just unordered wrt regular memory and DMA. That said, I doubt anything much cares, and I'm not sure alpha has ever done that. Linus
Re: [PATCH] add delay between port write and port read
On Wed, Feb 27, 2019 at 12:12:51PM -0500, Mikulas Patocka wrote: > > > On Tue, 26 Feb 2019, Linus Torvalds wrote: > > > Does anybody see any worries with the "just move the mb() earlier in > > the ioread functions" model? > > > > Linus > > It used to be like that and it worked. > > Then, commits cd0e00c106722eca40b38ebf11cf134c01901086 and > 92d7223a74235054f2aa7227d207d9c57f84dca0 came. > > These commits claim that they changed the code to be consistent with the > specification (now we have barrier after ioread and before iowrite). But > they broke serial port and RTC on my Alpha machine. I suppose you might need the mb() before *and* after the I/O access in the read case. The idea with readX()/ioreadX() is that you should be able to do something like: status = ioread8(STATUS_REG); if (status & RX_DMA_COMPLETE) memcpy(mem_buf, dma_buf, size); and rely on the DMA'd data being read out of the buffer. Will
Re: [PATCH] add delay between port write and port read
On 2/27/2019 12:12 PM, Mikulas Patocka wrote: It used to be like that and it worked. Then, commits cd0e00c106722eca40b38ebf11cf134c01901086 and 92d7223a74235054f2aa7227d207d9c57f84dca0 came. These commits claim that they changed the code to be consistent with the specification (now we have barrier after ioread and before iowrite). But they broke serial port and RTC on my Alpha machine. A barrier before write is needed to ensure that memory operations done are visible to the hardware before you send a write command. Barrier after read is needed to ensure that if you read a memory location after register read, memory contents are coherent. What we missed is the fact that alpha reorders accesses across two register accesses. This is guaranteed in other architectures. To satisfy this, you need something like barrier register write barrier register read barrier
Re: [PATCH] add delay between port write and port read
On Tue, 26 Feb 2019, Linus Torvalds wrote: > Does anybody see any worries with the "just move the mb() earlier in > the ioread functions" model? > > Linus It used to be like that and it worked. Then, commits cd0e00c106722eca40b38ebf11cf134c01901086 and 92d7223a74235054f2aa7227d207d9c57f84dca0 came. These commits claim that they changed the code to be consistent with the specification (now we have barrier after ioread and before iowrite). But they broke serial port and RTC on my Alpha machine. Mikulas
[PATCH v3 01/34] alpha: mm: Add p?d_large() definitions
walk_page_range() is going to be allowed to walk page tables other than those of user space. For this it needs to know when it has reached a 'leaf' entry in the page tables. This information will be provided by the p?d_large() functions/macros. For alpha, we don't support huge pages, so add stubs returning 0. CC: linux-alpha@vger.kernel.org CC: Richard Henderson CC: Ivan Kokshaysky CC: Matt Turner Signed-off-by: Steven Price --- arch/alpha/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h index 89c2032f9960..e5726d3a3200 100644 --- a/arch/alpha/include/asm/pgtable.h +++ b/arch/alpha/include/asm/pgtable.h @@ -254,11 +254,13 @@ extern inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *pt extern inline int pmd_none(pmd_t pmd) { return !pmd_val(pmd); } extern inline int pmd_bad(pmd_t pmd) { return (pmd_val(pmd) & ~_PFN_MASK) != _PAGE_TABLE; } extern inline int pmd_present(pmd_t pmd) { return pmd_val(pmd) & _PAGE_VALID; } +extern inline int pmd_large(pmd_t pmd) { return 0; } extern inline void pmd_clear(pmd_t * pmdp) { pmd_val(*pmdp) = 0; } extern inline int pgd_none(pgd_t pgd) { return !pgd_val(pgd); } extern inline int pgd_bad(pgd_t pgd) { return (pgd_val(pgd) & ~_PFN_MASK) != _PAGE_TABLE; } extern inline int pgd_present(pgd_t pgd) { return pgd_val(pgd) & _PAGE_VALID; } +extern inline int pgd_large(pgd_t pgd) { return 0; } extern inline void pgd_clear(pgd_t * pgdp) { pgd_val(*pgdp) = 0; } /* -- 2.20.1
Re: [PATCH] add delay between port write and port read
On Tue, 26 Feb 2019, Linus Torvalds wrote: > On Tue, Feb 26, 2019 at 10:38 AM Will Deacon wrote: > > > > That makes sense to me for this Alpha-specific case, but in general I > > don't think we require that I/O accesses to different endpoints are > > ordered with respect to each other, which was implied by one of Maciej's > > examples. e.g. > > > > writeb(0x42, _mmio_reg); > > readb(_mmio_reg); > > If they are the same device (just different data ports), I'd > *definitely* expect them to be ordered. > > We have tons of code that depends on that. Almost every driver out > there, in fact. > > So we need the mb() on alpha to guarantee the access ordering on the > CPU side, and then PCI itself ends up guaranteeing that accesses to > the same device will remain ordered outside the CPU. > > Agreed? > > Linus Should "writeb_relaxed" on Alpha also use the barrier? The understanding is that readX_relaxed and writeX_relaxed are ordered with other accesses to the same PCI device. If Alpha doesn't gaurantee that, they should use barriers too. Mikulas
Re: [PATCH] add delay between port write and port read
On Tue, Feb 26, 2019 at 10:43:24AM -0800, Linus Torvalds wrote: > On Tue, Feb 26, 2019 at 10:38 AM Will Deacon wrote: > > > > That makes sense to me for this Alpha-specific case, but in general I > > don't think we require that I/O accesses to different endpoints are > > ordered with respect to each other, which was implied by one of Maciej's > > examples. e.g. > > > > writeb(0x42, _mmio_reg); > > readb(_mmio_reg); > > If they are the same device (just different data ports), I'd > *definitely* expect them to be ordered. > > We have tons of code that depends on that. Almost every driver out > there, in fact. > > So we need the mb() on alpha to guarantee the access ordering on the > CPU side, and then PCI itself ends up guaranteeing that accesses to > the same device will remain ordered outside the CPU. > > Agreed? Yup, agreed. I'd consider all those ports to be the same endpoint, so we're good. Will
Re: [PATCH] add delay between port write and port read
On Tue, Feb 26, 2019 at 10:38 AM Will Deacon wrote: > > That makes sense to me for this Alpha-specific case, but in general I > don't think we require that I/O accesses to different endpoints are > ordered with respect to each other, which was implied by one of Maciej's > examples. e.g. > > writeb(0x42, _mmio_reg); > readb(_mmio_reg); If they are the same device (just different data ports), I'd *definitely* expect them to be ordered. We have tons of code that depends on that. Almost every driver out there, in fact. So we need the mb() on alpha to guarantee the access ordering on the CPU side, and then PCI itself ends up guaranteeing that accesses to the same device will remain ordered outside the CPU. Agreed? Linus
Re: [PATCH] add delay between port write and port read
On Tue, Feb 26, 2019 at 10:12:03AM -0800, Linus Torvalds wrote: > On Tue, Feb 26, 2019 at 9:52 AM Maciej W. Rozycki > wrote: > > > > The thing is taking for example `ioreadX' and `iowriteX' we have `mb' > > before a write and `mb' after a read. So if we do say (addresses are > > arbitrary for illustration only): > > > > iowrite8(0, 1); > > ioread8(2); > > > > then these effectively expand to chipset-specific: > > > > mb(); > > foo_iowrite8(0, 1); > > foo_ioread8(2); > > mb(); > > Yeah, looking at the current alpha io.c file, I'd suggest that all IO > operations just do the "mb()" _before_ doing the op. > > That way all IO ops are at least ordered wrt each other, any any IO > ops are ordered wrt any memory writes before it (in particular the > "people wrote to memory, and then did a IO write to start DMA" case). > > Also, since "spin_unlock()" on alpha has a memory barrier before > releasing the lock, it means that IO ends up being ordered wrt locks > too. > > Does it mean that a "ioread()" can then be delayed until after later > non-IO memory operations? Yes. But that seems harmless > > That said, I didn't spend a _ton_ of time thinking about this and > looking at the code, but it does seem like the simplest model really > is: "just always do a mb() before any IO ops". I think the "order IO > wrt each other, and wrt any _preceding_ memory traffic (and all > locking) is the important part, and that would seem to guarantee it. > > Obviously there can then be other re-ordering at the IO layer level > (ie posted writes and IO being re-ordered across different devices > etc), but that's universal and not alpha-specific. > > Does anybody see any worries with the "just move the mb() earlier in > the ioread functions" model? That makes sense to me for this Alpha-specific case, but in general I don't think we require that I/O accesses to different endpoints are ordered with respect to each other, which was implied by one of Maciej's examples. e.g. writeb(0x42, _mmio_reg); readb(_mmio_reg); are not ordered with respect to each other (and I think if you wanted this ordering in general you're going to need more than just fences). I've been trying to write this up in memory-barriers.txt, so if anybody has comments or concerns I'm keen to hear them, especially if there's a feeling that we should be providing stronger guarantees for port I/O: http://lkml.kernel.org/r/20190222175525.1198-1-will.dea...@arm.com Cheers, Will
Re: [PATCH] add delay between port write and port read
On Tue, Feb 26, 2019 at 9:52 AM Maciej W. Rozycki wrote: > > The thing is taking for example `ioreadX' and `iowriteX' we have `mb' > before a write and `mb' after a read. So if we do say (addresses are > arbitrary for illustration only): > > iowrite8(0, 1); > ioread8(2); > > then these effectively expand to chipset-specific: > > mb(); > foo_iowrite8(0, 1); > foo_ioread8(2); > mb(); Yeah, looking at the current alpha io.c file, I'd suggest that all IO operations just do the "mb()" _before_ doing the op. That way all IO ops are at least ordered wrt each other, any any IO ops are ordered wrt any memory writes before it (in particular the "people wrote to memory, and then did a IO write to start DMA" case). Also, since "spin_unlock()" on alpha has a memory barrier before releasing the lock, it means that IO ends up being ordered wrt locks too. Does it mean that a "ioread()" can then be delayed until after later non-IO memory operations? Yes. But that seems harmless That said, I didn't spend a _ton_ of time thinking about this and looking at the code, but it does seem like the simplest model really is: "just always do a mb() before any IO ops". I think the "order IO wrt each other, and wrt any _preceding_ memory traffic (and all locking) is the important part, and that would seem to guarantee it. Obviously there can then be other re-ordering at the IO layer level (ie posted writes and IO being re-ordered across different devices etc), but that's universal and not alpha-specific. Does anybody see any worries with the "just move the mb() earlier in the ioread functions" model? Linus
Re: [PATCH] add delay between port write and port read
Linus, I have added you to this discussion in hope you can clear any uncertainty there might be about MMIO ordering properties of the Alpha; a question for you is towards the end of this e-mail. On Thu, 21 Feb 2019, Mikulas Patocka wrote: > Do you think that we should fix this by identifying hardware that needs > the delays and adding the delays there? First of all we need to correctly identify what is really going on there. And I've had a look at our arch/alpha/kernel/io.c and arch/alpha/include/asm/io.h as they stand now and AFAICT what they have is completely broken and if it works anywhere, then by pure luck only, e.g. because external bus accesses complete quickly enough for code execution not to progress to a subsequent external bus access. The thing is taking for example `ioreadX' and `iowriteX' we have `mb' before a write and `mb' after a read. So if we do say (addresses are arbitrary for illustration only): iowrite8(0, 1); ioread8(2); then these effectively expand to chipset-specific: mb(); foo_iowrite8(0, 1); foo_ioread8(2); mb(); Which means `foo_iowrite8' and `foo_ioread8' are unordered WRT each other as there is no barrier between them. Worse yet for: iowrite8(0, 1); ioread8(1); the `ioread8' operation may not even appear externally, as the CPU may peek the value in the CPU's write buffer while the write is still pending. A similar consideration can be done for `readX' and `writeX', as well as a mixture between the two kinds of these accesses. We do need to add a preceding `mb' to all the `*readX' accessors to satisfy the strong ordering requirement WRT any preceding `*writeX' access. We need that leading `mb' in `*_relaxed' accessors as well. And we need to alias `mmiowb' to `wmb' (unless it's removed as I've seen with Will's recent patches). NB, the 82378ZB SIO adds a minimum of 5 ISA clocks between back-to-back accesses from PCI to ISA, which AFAIU should be enough for the PC87332 Super I/O. More can be configured if required, but I doubt that is needed, because it wasn't for 20+ years. > In my opinion, adding mb() to the port accessing functions is safer - it > is 6 line patch. The amount of code does not matter as long as it is wrong. We need to get this right. > Reading all the hardware manuals is time consuming and hardly anyone will > do it for 25 years old hardware. I do this regularly and I'm fine doing it this time too, also for the value of the education gained. I'll see if I can experiment with my hardware too, but regrettably that'll have to wait a couple of months at the very least (due to Brexit :( -- I have left my Alpha in the UK and the very last weekend I literally headed off to stay away from all of the upcoming mess). Meanwhile here is what I have tracked down in the architecture manual[1]: "Access to local I/O space does not cause any implicit read/write ordering; explicit barrier instructions must be used to ensure any desired ordering. Barrier instructions must be used: * After updating a memory-resident data structure and before writing a local I/O space location to notify the device of the updates. * Between multiple consecutive direct accesses to local I/O space, e.g. device control registers, if those accesses are expected to be ordered at the device. Again, note that implementations may cache not only memory-resident data structures, but also local I/O space locations." -- which in my opinion makes it clear that we need these barriers that were removed with commit 92d7223a7423 ("alpha: io: reorder barriers to guarantee writeX() and iowriteX() ordering #2"), in addition to ones added there (the `*_relaxed' accessors don't need the latter ones though). NB for some reason the whole chapter on I/O has been reduced in later revisions of the architecure manual[2][3] (as well as the shortened version of the book called the architecture handbook) to a single-page overview lacking any details. I think that does not make the architecture ordered any more strongly WRT MMIO though; instead it makes it even less specified. Linus, can you confirm or contradict my conclusions taken from the manual in the context of their later removal? You did the Alpha port back in mid-1990s (BTW, Jon 'maddog' spoke very fondly about that effort at FOSDEM earlier this month) and certainly had to know these details to complete it and you worked with people directly involved with the design of the Alpha at the time it was being developed, and given the somewhat terse formal specification you could therefore be the only person around I know of who can give a definite answer. I hope you still remember the bits. Questions, thoughts? References: [1] Richard L. Sites, ed., "Alpha Architecture Reference Manual", Digital Press, 1992, Chapter 8, "Input/Output (I)", Section 8.2.1 "Read/Write Ordering",
Re: [PATCH] add delay between port write and port read
On Wed, Feb 20, 2019 at 05:52:18PM +0100, Arnd Bergmann wrote: > On Wed, Feb 20, 2019 at 4:38 PM Maciej W. Rozycki > wrote: > > On Wed, 20 Feb 2019, Mikulas Patocka wrote: > > > > Well, actually `iowriteX' is generic and for MMIO you have `writeX'. > > > > See lib/iomap.c, the comment at the top and e.g. `iowrite8' there for an > > > > actual proof. Alpha may have an oddball implementation, but there you > > > > go. > > > > Drivers will assume they can do `iowriteX' to any kind of I/O resource, > > > > and ordering must be respected as per Documentation/memory-barriers.txt. > > > > > > So, do you think that the barrier whould be added to iowriteX and slow > > > down every MMIO access? > > > > We need it either for `outX' and `iowriteX' calls operating on port I/O > > resources, or for neither of them, both at a time, to ensure the required > > consistency between the two interfaces. If that badly affects MMIO (and > > is not required there; please remind me what the exact justification to > > use `mb' here is, as it's not entirely clear to me from the commit > > message; `mb' is a barrier and not means for a delay), then we need to > > find a away for `iowriteX' to tell port I/O and MMIO accesses apart and > > only apply the barrier for the former kind. > > Will Deacon is in the process of sanitizing our documentation for this, > and this part is still under discussion. This thread seems to be confusing barriers with delays in places, but fwiw, I agree with everything Maciej has said. I'm about to post a new version of my memory-barriers.txt updates, so I'll cc people on that patch and we can discuss the ordering side of things over there. Will
ATM VISA CARD
ATTENTION: WE RECEIVE YOUR CONTENT OF YOUR EMAIL FORM THIS DHL MASTER CARD OFFICES FUND OF $USD.8 MILLION AFTER THE BOARD OF DIRECTORS MEETING,WE DECIDED TO ISSUE YOU AN ATM VISA CARD VALUED AT 8 MILLION DOLLAR. THIS IS TO BRING TO YOUR NOTICE THAT YOUR VALUED SUM OF 8 MILLION DOLLAR HAS BEING TODAY CREDITED INTO ATM VISA CARD AND WE HAVE HANDLE IT TO OUR REMITANCE DEPARTMENT TO SEND IT TO YOU TODAY IN YOUR FAVOR.WITH THE VISA CARD YOU CAN BE ABLE TO MAKE WITHDRAW OF $10,000.00 UNITED STATE DOLLARS DAILIES AS PROGRAMMED UNTIL YOU WITHDRAW YOUR TOTAL SUM IN ANY ATM MACHINE. WE ARE SENDING YOU THE SCAN COPY OF YOUR ATM VISA CARD WHICH WE HAVE REGISTERED IN OUR SYSTEM FOR PAYMENT RECORD AS SOON AS WE RECIEVE YOUR INFORMATION AND YOUR HOME ADDRESS YOUR ATM VISA CARD WE BE SEND TO YOU THROUGH DHL COURIER OFFICE WE HAVE RECEIVED A SIGNAL FROM THE SWISS WORLD BANK TO INFECT YOUR TRANSFER TO YOU WITHIN ONE WEEK, WE HAVE JUST FINISH OUR ANNUAL MEETING WITH THE CENTRAL BANK OF BURKINA FASO (BCEAO) AND THE END OF THE MEETING WE HAVE CONCLUDED TO MAKE YOUR PAYMENT TO YOU THROUGH OUR ATM VISA CARD. AND YOUR VALUE SUM HAS BE CREDITED INTO YOUR ATM VISA ACCOUNT. WHICH YOU WILL USE TO WITHDRAW YOUR FUND IN ANY PART OF THE WORLD, WE HAVE ISSUE AND CREDIT YOUR ATM CARD IN YOUR NAME TODAY .THE CARD WILL BE INSURE AND SEND TO YOU THROUGH DHL COURIER SERVICE .VIEW YOUR ATTACHED FILE FOR THE COPY OF YOUR ATM CARD.THIS IS THE PAYMENT DEPARTMENT EMAIL ID FOR YOU TO CONTACT US (visadepartmen...@gmail.com) YOU SHOULD FORWARD TO THIS FINANCIAL INSTITUTION YOUR RESIDENTIAL ADDRESS AND CALL THE BANK DIRECTOR WITH THIS NUMBER :+226-79 15 93 02. CONGRATULATION NOTICE: IF YOU FIND ANYTHING THAT YOU DON’T UNDERSTAND PLEASE KINDLY INFORM THIS FINANCIAL INSTITUTION FOR URGENT ASSISTANCE.YOU ARE ADVISE TO SEND YOUR INFORMATION DOWN IMMEDATILY WITH YOUR PHONE NUMBER WHEN THEY ARRIVEL IN YOUR COUNTRY THEY WE CALL YOU. EMAIL;(visadepartmen...@gmail.com ) DIRECTOR NAME MUSA KAMAL
Re: [PATCH v4 3/3] locking/rwsem: Optimize down_read_trylock()
On 02/21/2019 09:14 AM, Will Deacon wrote: > On Wed, Feb 13, 2019 at 05:00:17PM -0500, Waiman Long wrote: >> Modify __down_read_trylock() to optimize for an unlocked rwsem and make >> it generate slightly better code. >> >> Before this patch, down_read_trylock: >> >>0x <+0>: callq 0x5 >>0x0005 <+5>: jmp0x18 >>0x0007 <+7>: lea0x1(%rdx),%rcx >>0x000b <+11>:mov%rdx,%rax >>0x000e <+14>:lock cmpxchg %rcx,(%rdi) >>0x0013 <+19>:cmp%rax,%rdx >>0x0016 <+22>:je 0x23 >>0x0018 <+24>:mov(%rdi),%rdx >>0x001b <+27>:test %rdx,%rdx >>0x001e <+30>:jns0x7 >>0x0020 <+32>:xor%eax,%eax >>0x0022 <+34>:retq >>0x0023 <+35>:mov%gs:0x0,%rax >>0x002c <+44>:or $0x3,%rax >>0x0030 <+48>:mov%rax,0x20(%rdi) >>0x0034 <+52>:mov$0x1,%eax >>0x0039 <+57>:retq >> >> After patch, down_read_trylock: >> >>0x <+0>: callq 0x5 >>0x0005 <+5>: xor%eax,%eax >>0x0007 <+7>: lea0x1(%rax),%rdx >>0x000b <+11>: lock cmpxchg %rdx,(%rdi) >>0x0010 <+16>: jne0x29 >>0x0012 <+18>: mov%gs:0x0,%rax >>0x001b <+27>: or $0x3,%rax >>0x001f <+31>: mov%rax,0x20(%rdi) >>0x0023 <+35>: mov$0x1,%eax >>0x0028 <+40>: retq >>0x0029 <+41>: test %rax,%rax >>0x002c <+44>: jns0x7 >>0x002e <+46>: xor%eax,%eax >>0x0030 <+48>: retq >> >> By using a rwsem microbenchmark, the down_read_trylock() rate (with a >> load of 10 to lengthen the lock critical section) on a x86-64 system >> before and after the patch were: >> >> Before PatchAfter Patch >># of Threads rlock rlock >> - - >> 1 14,496 14,716 >> 28,644 8,453 >> 46,799 6,983 >> 85,664 7,190 >> >> On a ARM64 system, the performance results were: >> >> Before PatchAfter Patch >># of Threads rlock rlock >> - - >> 1 23,676 24,488 >> 27,697 9,502 >> 44,945 3,440 >> 82,641 1,603 >> >> For the uncontended case (1 thread), the new down_read_trylock() is a >> little bit faster. For the contended cases, the new down_read_trylock() >> perform pretty well in x86-64, but performance degrades at high >> contention level on ARM64. >> >> Suggested-by: Linus Torvalds >> Signed-off-by: Waiman Long >> --- >> kernel/locking/rwsem.h | 13 - >> 1 file changed, 8 insertions(+), 5 deletions(-) >> >> diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h >> index 45ee002..1f5775a 100644 >> --- a/kernel/locking/rwsem.h >> +++ b/kernel/locking/rwsem.h >> @@ -174,14 +174,17 @@ static inline int __down_read_killable(struct >> rw_semaphore *sem) >> >> static inline int __down_read_trylock(struct rw_semaphore *sem) >> { >> -long tmp; >> +/* >> + * Optimize for the case when the rwsem is not locked at all. >> + */ >> +long tmp = RWSEM_UNLOCKED_VALUE; >> >> -while ((tmp = atomic_long_read(>count)) >= 0) { >> -if (tmp == atomic_long_cmpxchg_acquire(>count, tmp, >> - tmp + RWSEM_ACTIVE_READ_BIAS)) { >> +do { >> +if (atomic_long_try_cmpxchg_acquire(>count, , >> +tmp + RWSEM_ACTIVE_READ_BIAS)) { >> return 1; >> } >> -} >> +} while (tmp >= 0); > Nit: but I guess that should be RWSEM_UNLOCKED_VALUE instead of 0. > > Will Using RWSEM_UNLOCKED_VALUE may be better. Anyway, it is not a big deal as I am going to change this again in a later patch. Thanks, Longman RWSEM_UNLOCKED_VALUE
Re: [PATCH] add delay between port write and port read
Do you think that we should fix this by identifying hardware that needs the delays and adding the delays there? In my opinion, adding mb() to the port accessing functions is safer - it is 6 line patch. Reading all the hardware manuals is time consuming and hardly anyone will do it for 25 years old hardware. Mikulas On Wed, 20 Feb 2019, Maciej W. Rozycki wrote: > On Wed, 20 Feb 2019, Mikulas Patocka wrote: > > > > Will Deacon is in the process of sanitizing our documentation for this, > > > and this part is still under discussion. > > > > > > I personally don't see a problem with having different barrier > > > semantics between outb() and iowrite8(), the reasoning being that > > > any caller of iowriteX() would already have to be careful about > > > posted writes when iowriteX is backed by ioremapped memory > > > space rather than port I/O. > > > > > > I think the more important question we have to figure out here > > > however is why exactly the barrier is needed for alpha, as I still > > > don't fully understand the issue: > > > > Port delays were common on MS-DOS with the ISA bus. On the ISA bus, the > > device cannot reject a transaction, so the programmer has to wait until > > the device is ready before talking to it. > > > > My hypothesis is that the engineers who built this Alpha Avanti machine > > simply bought some crappy serial line and real-time clock logic from some > > ISA bus vendor. And so it requires delays. The PCI stuff on the same > > machine doesn't require any extra delays. > > What you call crap was in the day, i.e. back in 1994, a state-of-the art > ISA super I/O chip (the NS87332) and a standard RTC chip (the BQ3287). > There was no LPC in existence yet when the Avanti line was built, the > south bridge used was the i82378 SIO, additionally wiring two actual ISA > slots. > > Datasheets fo all these pieces should be available; I'm fairly sure I > have at least some of them myself. Also full engineering documentation is > available for the Avanti. So it can all be verified quite easily. > > If any of these devices need actual ISA access delays, then `outX_p' and > `inX_p' accessors should be used by the respective drivers, and we need to > implement these accordingly; that would be no different to a standard x86 > PC with say an ISA serial port (I still have such hardware BTW). > > BTW, I have recently got an Avanti system myself, an AS/250 I believe, > although for an unknown reason placed in an AS/300 enclosure. The main > board of the two systems is identical and the only difference is the > maximum amount of RAM supported anyway. I have't wired it properly yet > though, so I can't do any verification ATM, but I will try sometime later. > > > > a) maybe the barrier here is only needed to provide the > > >non-posted PCI write behavior required by outb(), in that > > >case I would suggest adding the barriers to outX() but > > >perhaps not iowriteX() > > > > Except for serial line and real-time-clock, the machine works OK. So > > there's no need to slow down regural PCI accesses that use iowriteX. > > If actual delays are required rather than strong ordering only, then the > respective drivers need to be fixed, however platform code still has to > enforce the ordering semantics we put into the respective internal > interfaces. > > > > b) or the barriers are in fact needed for /any/ I/O operation > > >on alpha, to ensure that a store is ordered with regard to > > >a following load. If this is the case, we need the barriers > > >in all three families: outX(), writeX() and iowriteX(). > > > > My understanding is that multiple accesses to the same PCI space are > > ordered. > > I believe they are not; there's not such thing as "a PCI space" from the > CPU's point of view, its just another MMIO location, and it's the CPU that > is weakly ordered, not the chipset. > > I'll come back with the relevant citations from the architecture manual, > ISTR I quoted them before already. > > Maciej >
Re: [PATCH v4 3/3] locking/rwsem: Optimize down_read_trylock()
On Wed, Feb 13, 2019 at 05:00:17PM -0500, Waiman Long wrote: > Modify __down_read_trylock() to optimize for an unlocked rwsem and make > it generate slightly better code. > > Before this patch, down_read_trylock: > >0x <+0>: callq 0x5 >0x0005 <+5>: jmp0x18 >0x0007 <+7>: lea0x1(%rdx),%rcx >0x000b <+11>:mov%rdx,%rax >0x000e <+14>:lock cmpxchg %rcx,(%rdi) >0x0013 <+19>:cmp%rax,%rdx >0x0016 <+22>:je 0x23 >0x0018 <+24>:mov(%rdi),%rdx >0x001b <+27>:test %rdx,%rdx >0x001e <+30>:jns0x7 >0x0020 <+32>:xor%eax,%eax >0x0022 <+34>:retq >0x0023 <+35>:mov%gs:0x0,%rax >0x002c <+44>:or $0x3,%rax >0x0030 <+48>:mov%rax,0x20(%rdi) >0x0034 <+52>:mov$0x1,%eax >0x0039 <+57>:retq > > After patch, down_read_trylock: > >0x <+0>: callq 0x5 >0x0005 <+5>: xor%eax,%eax >0x0007 <+7>: lea0x1(%rax),%rdx >0x000b <+11>: lock cmpxchg %rdx,(%rdi) >0x0010 <+16>: jne0x29 >0x0012 <+18>: mov%gs:0x0,%rax >0x001b <+27>: or $0x3,%rax >0x001f <+31>: mov%rax,0x20(%rdi) >0x0023 <+35>: mov$0x1,%eax >0x0028 <+40>: retq >0x0029 <+41>: test %rax,%rax >0x002c <+44>: jns0x7 >0x002e <+46>: xor%eax,%eax >0x0030 <+48>: retq > > By using a rwsem microbenchmark, the down_read_trylock() rate (with a > load of 10 to lengthen the lock critical section) on a x86-64 system > before and after the patch were: > > Before PatchAfter Patch ># of Threads rlock rlock > - - > 1 14,496 14,716 > 28,644 8,453 > 46,799 6,983 > 85,664 7,190 > > On a ARM64 system, the performance results were: > > Before PatchAfter Patch ># of Threads rlock rlock > - - > 1 23,676 24,488 > 27,697 9,502 > 44,945 3,440 > 82,641 1,603 > > For the uncontended case (1 thread), the new down_read_trylock() is a > little bit faster. For the contended cases, the new down_read_trylock() > perform pretty well in x86-64, but performance degrades at high > contention level on ARM64. > > Suggested-by: Linus Torvalds > Signed-off-by: Waiman Long > --- > kernel/locking/rwsem.h | 13 - > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h > index 45ee002..1f5775a 100644 > --- a/kernel/locking/rwsem.h > +++ b/kernel/locking/rwsem.h > @@ -174,14 +174,17 @@ static inline int __down_read_killable(struct > rw_semaphore *sem) > > static inline int __down_read_trylock(struct rw_semaphore *sem) > { > - long tmp; > + /* > + * Optimize for the case when the rwsem is not locked at all. > + */ > + long tmp = RWSEM_UNLOCKED_VALUE; > > - while ((tmp = atomic_long_read(>count)) >= 0) { > - if (tmp == atomic_long_cmpxchg_acquire(>count, tmp, > -tmp + RWSEM_ACTIVE_READ_BIAS)) { > + do { > + if (atomic_long_try_cmpxchg_acquire(>count, , > + tmp + RWSEM_ACTIVE_READ_BIAS)) { > return 1; > } > - } > + } while (tmp >= 0); Nit: but I guess that should be RWSEM_UNLOCKED_VALUE instead of 0. Will
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
On Thu 21-02-19 01:23:50, Meelis Roos wrote: > > > First, I found out that both the problematic alphas had memory compaction > > > and > > > page migration and bounce buffers turned on, and working alphas had them > > > off. > > > > > > Next, turing off these options makes the problematic alphas work. > > > > OK, thanks for testing! Can you narrow down whether the problem is due to > > CONFIG_BOUNCE or CONFIG_MIGRATION + CONFIG_COMPACTION? These are two > > completely different things so knowing where to look will help. Thanks! > > Tested both. > > Just CONFIG_MIGRATION + CONFIG_COMPACTION breaks the alpha. > Just CONFIG_BOUNCE has no effect in 5 tries. OK, so page migration is problematic. Thanks for confirmation! Honza -- Jan Kara SUSE Labs, CR
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
First, I found out that both the problematic alphas had memory compaction and page migration and bounce buffers turned on, and working alphas had them off. Next, turing off these options makes the problematic alphas work. OK, thanks for testing! Can you narrow down whether the problem is due to CONFIG_BOUNCE or CONFIG_MIGRATION + CONFIG_COMPACTION? These are two completely different things so knowing where to look will help. Thanks! Tested both. Just CONFIG_MIGRATION + CONFIG_COMPACTION breaks the alpha. Just CONFIG_BOUNCE has no effect in 5 tries. -- Meelis Roos
Re: [PATCH] add delay between port write and port read
On Wed, 20 Feb 2019, Mikulas Patocka wrote: > > Will Deacon is in the process of sanitizing our documentation for this, > > and this part is still under discussion. > > > > I personally don't see a problem with having different barrier > > semantics between outb() and iowrite8(), the reasoning being that > > any caller of iowriteX() would already have to be careful about > > posted writes when iowriteX is backed by ioremapped memory > > space rather than port I/O. > > > > I think the more important question we have to figure out here > > however is why exactly the barrier is needed for alpha, as I still > > don't fully understand the issue: > > Port delays were common on MS-DOS with the ISA bus. On the ISA bus, the > device cannot reject a transaction, so the programmer has to wait until > the device is ready before talking to it. > > My hypothesis is that the engineers who built this Alpha Avanti machine > simply bought some crappy serial line and real-time clock logic from some > ISA bus vendor. And so it requires delays. The PCI stuff on the same > machine doesn't require any extra delays. What you call crap was in the day, i.e. back in 1994, a state-of-the art ISA super I/O chip (the NS87332) and a standard RTC chip (the BQ3287). There was no LPC in existence yet when the Avanti line was built, the south bridge used was the i82378 SIO, additionally wiring two actual ISA slots. Datasheets fo all these pieces should be available; I'm fairly sure I have at least some of them myself. Also full engineering documentation is available for the Avanti. So it can all be verified quite easily. If any of these devices need actual ISA access delays, then `outX_p' and `inX_p' accessors should be used by the respective drivers, and we need to implement these accordingly; that would be no different to a standard x86 PC with say an ISA serial port (I still have such hardware BTW). BTW, I have recently got an Avanti system myself, an AS/250 I believe, although for an unknown reason placed in an AS/300 enclosure. The main board of the two systems is identical and the only difference is the maximum amount of RAM supported anyway. I have't wired it properly yet though, so I can't do any verification ATM, but I will try sometime later. > > a) maybe the barrier here is only needed to provide the > >non-posted PCI write behavior required by outb(), in that > >case I would suggest adding the barriers to outX() but > >perhaps not iowriteX() > > Except for serial line and real-time-clock, the machine works OK. So > there's no need to slow down regural PCI accesses that use iowriteX. If actual delays are required rather than strong ordering only, then the respective drivers need to be fixed, however platform code still has to enforce the ordering semantics we put into the respective internal interfaces. > > b) or the barriers are in fact needed for /any/ I/O operation > >on alpha, to ensure that a store is ordered with regard to > >a following load. If this is the case, we need the barriers > >in all three families: outX(), writeX() and iowriteX(). > > My understanding is that multiple accesses to the same PCI space are > ordered. I believe they are not; there's not such thing as "a PCI space" from the CPU's point of view, its just another MMIO location, and it's the CPU that is weakly ordered, not the chipset. I'll come back with the relevant citations from the architecture manual, ISTR I quoted them before already. Maciej
Re: [PATCH] add delay between port write and port read
On Wed, 20 Feb 2019, Arnd Bergmann wrote: > On Wed, Feb 20, 2019 at 4:38 PM Maciej W. Rozycki > wrote: > > On Wed, 20 Feb 2019, Mikulas Patocka wrote: > > > > Well, actually `iowriteX' is generic and for MMIO you have `writeX'. > > > > See lib/iomap.c, the comment at the top and e.g. `iowrite8' there for an > > > > actual proof. Alpha may have an oddball implementation, but there you > > > > go. > > > > Drivers will assume they can do `iowriteX' to any kind of I/O resource, > > > > and ordering must be respected as per Documentation/memory-barriers.txt. > > > > > > So, do you think that the barrier whould be added to iowriteX and slow > > > down every MMIO access? > > > > We need it either for `outX' and `iowriteX' calls operating on port I/O > > resources, or for neither of them, both at a time, to ensure the required > > consistency between the two interfaces. If that badly affects MMIO (and > > is not required there; please remind me what the exact justification to > > use `mb' here is, as it's not entirely clear to me from the commit > > message; `mb' is a barrier and not means for a delay), then we need to > > find a away for `iowriteX' to tell port I/O and MMIO accesses apart and > > only apply the barrier for the former kind. > > Will Deacon is in the process of sanitizing our documentation for this, > and this part is still under discussion. > > I personally don't see a problem with having different barrier > semantics between outb() and iowrite8(), the reasoning being that > any caller of iowriteX() would already have to be careful about > posted writes when iowriteX is backed by ioremapped memory > space rather than port I/O. > > I think the more important question we have to figure out here > however is why exactly the barrier is needed for alpha, as I still > don't fully understand the issue: Port delays were common on MS-DOS with the ISA bus. On the ISA bus, the device cannot reject a transaction, so the programmer has to wait until the device is ready before talking to it. My hypothesis is that the engineers who built this Alpha Avanti machine simply bought some crappy serial line and real-time clock logic from some ISA bus vendor. And so it requires delays. The PCI stuff on the same machine doesn't require any extra delays. > a) maybe the barrier here is only needed to provide the >non-posted PCI write behavior required by outb(), in that >case I would suggest adding the barriers to outX() but >perhaps not iowriteX() Except for serial line and real-time-clock, the machine works OK. So there's no need to slow down regural PCI accesses that use iowriteX. > b) or the barriers are in fact needed for /any/ I/O operation >on alpha, to ensure that a store is ordered with regard to >a following load. If this is the case, we need the barriers >in all three families: outX(), writeX() and iowriteX(). > > Arnd My understanding is that multiple accesses to the same PCI space are ordered. Mikulas
Re: [PATCH] add delay between port write and port read
On Wed, 20 Feb 2019, Arnd Bergmann wrote: > I personally don't see a problem with having different barrier > semantics between outb() and iowrite8(), the reasoning being that > any caller of iowriteX() would already have to be careful about > posted writes when iowriteX is backed by ioremapped memory > space rather than port I/O. The problem is architectures (such as certainly MIPS) may imply posted writes for accesses that ultimately end up as port I/O (most architectures obviously have no real port I/O, and it all boils down to accessing an MMIO address range used to issue I/O access bus cycles to ISA or PCI), due to write buffering within the CPU itself. This has to be dealt with by explicits barrier instructions inserted in port-specific I/O accessors so that generic driver code has a consistent view regardless of the peculiarities of the underlying CPU architecture. > b) or the barriers are in fact needed for /any/ I/O operation >on alpha, to ensure that a store is ordered with regard to >a following load. If this is the case, we need the barriers >in all three families: outX(), writeX() and iowriteX(). I do believe Alpha is weakly ordered, at least architecturally, when it comes to any I/O (uncached) operations; I'll double-check with the architecture manual, as I recall the wording there was peculiar. NB I did a bunch of fixes for the MIPS port somewhat recently: 8b656253a7a4 MIPS: Provide actually relaxed MMIO accessors 3d474dacae72 MIPS: Enforce strong ordering for MMIO accessors a711d43cbbaa MIPS: Correct `mmiowb' barrier for `wbflush' platforms 4ae0452bddca MIPS: Define MMIO ordering barriers as it was similarly broken in this regard. Maciej
Re: [PATCH] add delay between port write and port read
On Tue, 19 Feb 2019, Maciej W. Rozycki wrote: > On Tue, 19 Feb 2019, Mikulas Patocka wrote: > > > > As I say, shouldn't the barrier be in `iowriteX' instead? I don't > > > suppose these are allowed to be weakly ordered. > > > > > > Maciej > > > > iowriteX is for memory-mapped I/O, out[bwl] is for I/O ports. > > Well, actually `iowriteX' is generic and for MMIO you have `writeX'. > See lib/iomap.c, the comment at the top and e.g. `iowrite8' there for an > actual proof. Alpha may have an oddball implementation, but there you go. > Drivers will assume they can do `iowriteX' to any kind of I/O resource, > and ordering must be respected as per Documentation/memory-barriers.txt. > > Maciej So, do you think that the barrier whould be added to iowriteX and slow down every MMIO access? Mikulas
Re: [PATCH v4 1/3] locking/rwsem: Remove arch specific rwsem files
On Wed, Feb 13, 2019 at 05:00:15PM -0500, Waiman Long wrote: > As the generic rwsem-xadd code is using the appropriate acquire and > release versions of the atomic operations, the arch specific rwsem.h > files will not be that much faster than the generic code as long as the > atomic functions are properly implemented. So we can remove those arch > specific rwsem.h and stop building asm/rwsem.h to reduce maintenance > effort. > > Currently, only x86, alpha and ia64 have implemented architecture > specific fast paths. I don't have access to alpha and ia64 systems for > testing, but they are legacy systems that are not likely to be updated > to the latest kernel anyway. > > By using a rwsem microbenchmark, the total locking rates on a 4-socket > 56-core 112-thread x86-64 system before and after the patch were as > follows (mixed means equal # of read and write locks): > > Before Patch After Patch ># of Threads wlock rlock mixed wlock rlock mixed > - - - - - - > 129,201 30,143 29,45828,615 30,172 29,201 > 2 6,807 13,299 1,171 7,725 15,025 1,804 > 4 6,504 12,755 1,520 7,127 14,286 1,345 > 8 6,762 13,412 764 6,826 13,652 726 >16 6,693 15,408 662 6,599 15,938 626 >32 6,145 15,286 496 5,549 15,487 511 >64 5,812 15,495 60 5,858 15,572 60 > > There were some run-to-run variations for the multi-thread tests. For > x86-64, using the generic C code fast path seems to be a little bit > faster than the assembly version with low lock contention. Looking at > the assembly version of the fast paths, there are assembly to/from C > code wrappers that save and restore all the callee-clobbered registers > (7 registers on x86-64). The assembly generated from the generic C > code doesn't need to do that. That may explain the slight performance > gain here. > > The generic asm rwsem.h can also be merged into kernel/locking/rwsem.h > with no code change as no other code other than those under > kernel/locking needs to access the internal rwsem macros and functions. > > Signed-off-by: Waiman Long > --- > MAINTAINERS | 1 - > arch/alpha/include/asm/rwsem.h | 211 --- > arch/arm/include/asm/Kbuild | 1 - > arch/arm64/include/asm/Kbuild | 1 - > arch/hexagon/include/asm/Kbuild | 1 - > arch/ia64/include/asm/rwsem.h | 172 - > arch/powerpc/include/asm/Kbuild | 1 - > arch/s390/include/asm/Kbuild| 1 - > arch/sh/include/asm/Kbuild | 1 - > arch/sparc/include/asm/Kbuild | 1 - > arch/x86/include/asm/rwsem.h| 237 > > arch/x86/lib/Makefile | 1 - > arch/x86/lib/rwsem.S| 156 -- > arch/x86/um/Makefile| 1 - > arch/xtensa/include/asm/Kbuild | 1 - > include/asm-generic/rwsem.h | 140 > include/linux/rwsem.h | 4 +- > kernel/locking/percpu-rwsem.c | 2 + > kernel/locking/rwsem.h | 130 ++ > 19 files changed, 133 insertions(+), 930 deletions(-) > delete mode 100644 arch/alpha/include/asm/rwsem.h > delete mode 100644 arch/ia64/include/asm/rwsem.h > delete mode 100644 arch/x86/include/asm/rwsem.h > delete mode 100644 arch/x86/lib/rwsem.S > delete mode 100644 include/asm-generic/rwsem.h Looks like a nice cleanup, thanks: Acked-by: Will Deacon Will
Re: [PATCH v4 2/3] locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all archs
On Wed, Feb 13, 2019 at 05:00:16PM -0500, Waiman Long wrote: > Currently, we have two different implementation of rwsem: > 1) CONFIG_RWSEM_GENERIC_SPINLOCK (rwsem-spinlock.c) > 2) CONFIG_RWSEM_XCHGADD_ALGORITHM (rwsem-xadd.c) > > As we are going to use a single generic implementation for rwsem-xadd.c > and no architecture-specific code will be needed, there is no point > in keeping two different implementations of rwsem. In most cases, the > performance of rwsem-spinlock.c will be worse. It also doesn't get all > the performance tuning and optimizations that had been implemented in > rwsem-xadd.c over the years. > > For simplication, we are going to remove rwsem-spinlock.c and make all > architectures use a single implementation of rwsem - rwsem-xadd.c. > > All references to RWSEM_GENERIC_SPINLOCK and RWSEM_XCHGADD_ALGORITHM > in the code are removed. > > Suggested-by: Peter Zijlstra > Signed-off-by: Waiman Long > --- > arch/alpha/Kconfig | 7 - > arch/arc/Kconfig| 3 - > arch/arm/Kconfig| 4 - > arch/arm64/Kconfig | 3 - > arch/c6x/Kconfig| 3 - > arch/csky/Kconfig | 3 - > arch/h8300/Kconfig | 3 - > arch/hexagon/Kconfig| 6 - > arch/ia64/Kconfig | 4 - > arch/m68k/Kconfig | 7 - > arch/microblaze/Kconfig | 6 - > arch/mips/Kconfig | 7 - > arch/nds32/Kconfig | 3 - > arch/nios2/Kconfig | 3 - > arch/openrisc/Kconfig | 6 - > arch/parisc/Kconfig | 6 - > arch/powerpc/Kconfig| 7 - > arch/riscv/Kconfig | 3 - > arch/s390/Kconfig | 6 - > arch/sh/Kconfig | 6 - > arch/sparc/Kconfig | 8 - > arch/unicore32/Kconfig | 6 - > arch/x86/Kconfig| 3 - > arch/x86/um/Kconfig | 6 - > arch/xtensa/Kconfig | 3 - > include/linux/rwsem-spinlock.h | 47 -- > include/linux/rwsem.h | 5 - > kernel/Kconfig.locks| 2 +- > kernel/locking/Makefile | 4 +- > kernel/locking/rwsem-spinlock.c | 339 > > kernel/locking/rwsem.h | 3 - > 31 files changed, 2 insertions(+), 520 deletions(-) > delete mode 100644 include/linux/rwsem-spinlock.h > delete mode 100644 kernel/locking/rwsem-spinlock.c Another nice cleanup, and it looks like the optimistic spinning in the xadd implementation is predicated on ARCH_SUPPORTS_ATOMIC_RMW, so we're all good there too. Acked-by: Will Deacon Will
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
On Wed 20-02-19 08:31:05, Meelis Roos wrote: > > Could > > https://lore.kernel.org/linux-mm/20190219123212.29838-1-lar...@axis.com/T/#u > > be relevant? > > Tried it, still broken. OK, I didn't put too much hope into this patch as you see filesystem metadata corruption so icache/dcache coherency issues seemed unlikely. Still good that you've tried so that we are sure. > I wrote: > > > But my kernel config had memory compaction (that turned on page migration) > > and > > bounce buffers. I do not remember why I found them necessary but I will try > > without them. > > First, I found out that both the problematic alphas had memory compaction and > page migration and bounce buffers turned on, and working alphas had them off. > > Next, turing off these options makes the problematic alphas work. OK, thanks for testing! Can you narrow down whether the problem is due to CONFIG_BOUNCE or CONFIG_MIGRATION + CONFIG_COMPACTION? These are two completely different things so knowing where to look will help. Thanks! Honza -- Jan Kara SUSE Labs, CR
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
Could https://lore.kernel.org/linux-mm/20190219123212.29838-1-lar...@axis.com/T/#u be relevant? Tried it, still broken. I wrote: But my kernel config had memory compaction (that turned on page migration) and bounce buffers. I do not remember why I found them necessary but I will try without them. First, I found out that both the problematic alphas had memory compaction and page migration and bounce buffers turned on, and working alphas had them off. Next, turing off these options makes the problematic alphas work. -- Meelis Roos
Re: [PATCH] add delay between port write and port read
On Tue, 19 Feb 2019, Mikulas Patocka wrote: > > As I say, shouldn't the barrier be in `iowriteX' instead? I don't > > suppose these are allowed to be weakly ordered. > > > > Maciej > > iowriteX is for memory-mapped I/O, out[bwl] is for I/O ports. Well, actually `iowriteX' is generic and for MMIO you have `writeX'. See lib/iomap.c, the comment at the top and e.g. `iowrite8' there for an actual proof. Alpha may have an oddball implementation, but there you go. Drivers will assume they can do `iowriteX' to any kind of I/O resource, and ordering must be respected as per Documentation/memory-barriers.txt. Maciej
Re: [PATCH] add delay between port write and port read
On Tue, 19 Feb 2019, Arnd Bergmann wrote: > On Tue, Feb 19, 2019 at 2:44 PM Mikulas Patocka wrote: > > On Tue, 19 Feb 2019, Mikulas Patocka wrote: > > > > > The patches cd0e00c106722eca40b38ebf11cf134c01901086 and > > > 92d7223a74235054f2aa7227d207d9c57f84dca0 fix a theoretical issue where the > > > code didn't follow the specification. Unfortunatelly, they also reduce > > > timing when port write is followed by a port read. > > > > > > These reduced timing cause hang on boot on the Avanti platform when > > > probing serial ports. This patch adds memory barrier after the outb, outw, > > > outl functions, so that there is delay between port write and subsequent > > > port read - just like before. > > > > > > Fixes: cd0e00c10672 ("alpha: io: reorder barriers to guarantee writeX() > > > and iowriteX() ordering") > > > Cc: sta...@vger.kernel.org# v4.17+ > > > > you can also add: > > > > Tested-by: Mikulas Patocka > > Acked-by: Arnd Bergmann > > but I notice you are missing Signed-off-by. I forgot about it. You made that patch, so there should be your signed-off too. Signed-off-by: Mikulas Patocka > We clearly need this patch, but I assumed the alpha maintainers would pick > it up, not me. I merged the original changes since they were > cross-architecture, > but I don't normally take patches for a particular architecture through the > asm-generic tree (or the soc tree for that matter). > > Arnd >
Re: [PATCH] add delay between port write and port read
On Tue, 19 Feb 2019, Maciej W. Rozycki wrote: > On Tue, 19 Feb 2019, Arnd Bergmann wrote: > > > We clearly need this patch, but I assumed the alpha maintainers would pick > > it up, not me. I merged the original changes since they were > > cross-architecture, > > but I don't normally take patches for a particular architecture through the > > asm-generic tree (or the soc tree for that matter). > > As I say, shouldn't the barrier be in `iowriteX' instead? I don't > suppose these are allowed to be weakly ordered. > > Maciej iowriteX is for memory-mapped I/O, out[bwl] is for I/O ports. If someone finds a problem with memory-mapped device, he can add the barier there. Mikulas
Re: [PATCH] add delay between port write and port read
On Tue, 19 Feb 2019, Arnd Bergmann wrote: > We clearly need this patch, but I assumed the alpha maintainers would pick > it up, not me. I merged the original changes since they were > cross-architecture, > but I don't normally take patches for a particular architecture through the > asm-generic tree (or the soc tree for that matter). As I say, shouldn't the barrier be in `iowriteX' instead? I don't suppose these are allowed to be weakly ordered. Maciej
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
On Tue, Feb 19, 2019 at 02:20:26PM +0100, Jan Kara wrote: > Thanks for information. Yeah, that makes somewhat more sense. Can you ever > see the failure if you disable CONFIG_TRANSPARENT_HUGEPAGE? Because your > findings still seem to indicate that there' some problem with page > migration and Alpha (added MM list to CC). Could https://lore.kernel.org/linux-mm/20190219123212.29838-1-lar...@axis.com/T/#u be relevant?
Re: [PATCH] add delay between port write and port read
On Tue, Feb 19, 2019 at 2:44 PM Mikulas Patocka wrote: > On Tue, 19 Feb 2019, Mikulas Patocka wrote: > > > The patches cd0e00c106722eca40b38ebf11cf134c01901086 and > > 92d7223a74235054f2aa7227d207d9c57f84dca0 fix a theoretical issue where the > > code didn't follow the specification. Unfortunatelly, they also reduce > > timing when port write is followed by a port read. > > > > These reduced timing cause hang on boot on the Avanti platform when > > probing serial ports. This patch adds memory barrier after the outb, outw, > > outl functions, so that there is delay between port write and subsequent > > port read - just like before. > > > > Fixes: cd0e00c10672 ("alpha: io: reorder barriers to guarantee writeX() and > > iowriteX() ordering") > > Cc: sta...@vger.kernel.org# v4.17+ > > you can also add: > > Tested-by: Mikulas Patocka Acked-by: Arnd Bergmann but I notice you are missing Signed-off-by. We clearly need this patch, but I assumed the alpha maintainers would pick it up, not me. I merged the original changes since they were cross-architecture, but I don't normally take patches for a particular architecture through the asm-generic tree (or the soc tree for that matter). Arnd
[PATCH] add delay between port write and port read
The patches cd0e00c106722eca40b38ebf11cf134c01901086 and 92d7223a74235054f2aa7227d207d9c57f84dca0 fix a theoretical issue where the code didn't follow the specification. Unfortunatelly, they also reduce timing when port write is followed by a port read. These reduced timing cause hang on boot on the Avanti platform when probing serial ports. This patch adds memory barrier after the outb, outw, outl functions, so that there is delay between port write and subsequent port read - just like before. Fixes: cd0e00c10672 ("alpha: io: reorder barriers to guarantee writeX() and iowriteX() ordering") Cc: sta...@vger.kernel.org # v4.17+ --- arch/alpha/kernel/io.c |6 ++ 1 file changed, 6 insertions(+) Index: linux-stable/arch/alpha/kernel/io.c === --- linux-stable.orig/arch/alpha/kernel/io.c2019-02-19 14:06:03.0 +0100 +++ linux-stable/arch/alpha/kernel/io.c 2019-02-19 14:12:29.0 +0100 @@ -78,16 +78,19 @@ u32 inl(unsigned long port) void outb(u8 b, unsigned long port) { iowrite8(b, ioport_map(port, 1)); + mb(); } void outw(u16 b, unsigned long port) { iowrite16(b, ioport_map(port, 2)); + mb(); } void outl(u32 b, unsigned long port) { iowrite32(b, ioport_map(port, 4)); + mb(); } EXPORT_SYMBOL(inb); @@ -336,6 +339,7 @@ void iowrite8_rep(void __iomem *port, co void outsb(unsigned long port, const void *src, unsigned long count) { iowrite8_rep(ioport_map(port, 1), src, count); + mb(); } EXPORT_SYMBOL(iowrite8_rep); @@ -376,6 +380,7 @@ void iowrite16_rep(void __iomem *port, c void outsw(unsigned long port, const void *src, unsigned long count) { iowrite16_rep(ioport_map(port, 2), src, count); + mb(); } EXPORT_SYMBOL(iowrite16_rep); @@ -408,6 +413,7 @@ void iowrite32_rep(void __iomem *port, c void outsl(unsigned long port, const void *src, unsigned long count) { iowrite32_rep(ioport_map(port, 4), src, count); + mb(); } EXPORT_SYMBOL(iowrite32_rep);
Re: Alpha Avanti broken by 9ce8654323d69273b4977f76f11c9e2d345ab130
On Tue, 21 Aug 2018, Arnd Bergmann wrote: > On Mon, Aug 20, 2018 at 11:42 PM Mikulas Patocka wrote: > > On Mon, 20 Aug 2018, Arnd Bergmann wrote: > > > > > On Mon, Aug 20, 2018 at 4:17 PM Mikulas Patocka > > > wrote: > > > > On Sun, 19 Aug 2018, ok...@codeaurora.org wrote: > > > > > > > > > +my new email > > > > > > > > > > On 2018-08-18 19:03, Arnd Bergmann wrote: > > > > > > On Sat, Aug 18, 2018 at 12:05 AM Maciej W. Rozycki > > > > > > > > > > > > > > I think we need to identify the driver that is failing. > > > > > > > > It also may be some timing issue. > > > > > > > > I observed that not every kernel with the patch > > > > 92d7223a74235054f2aa7227d207d9c57f84dca0 fails, some of them get stuck > > > > only at boot, some get stuck only at shutdown, some not at all. Although > > > > all the kernels with this patch reverted work. > > > > > > > > So the patch may have uncovered some timing problem somewhere. > > > > > > > > x86 has the function io_delay that injects delays between I/O accesses > > > > for > > > > hardware that needs it - does alpha have something like this? > > > > > > The I/O delay would be very low on my list of possible root causes > > > for this, hardly any hardware at all relies on it, and all uses I see > > > are related to outb(), which you've already shown not to be the problem > > > with my test patch. > > > > > The lockup happens somewhere in the function autoconfig in > > drivers/tty/serial/8250/8250_port.c, but I don't know where exactly > > because serial console doesn't work while the port is being probed. > > > > When I use console on a graphics card, the lockup doesn't happen. > > Ok, this does strongly suggest that it is the outb() operation that I > suspected after all, I just sent you a wrong patch to test, failing > to realize that alpha has two implementations of outb, and that the > extern one is the one that gets used in a defconfig build. > > Could you try again with this patch added in? (Sorry for the whitespace > damage, you'll have to apply it by hand). Presumably a wmb() > is sufficient here, but I'm trying to play safe here by restoring the > barrier that was part of outb() before it broke. > >Arnd Hi Will you commit this patch? The avanti platform is still broken in the kernel 5.0 and I tested that this patch fixes it. Mikulas > diff --git a/arch/alpha/kernel/io.c b/arch/alpha/kernel/io.c > index c025a3e5e357..604237fa821f 100644 > --- a/arch/alpha/kernel/io.c > +++ b/arch/alpha/kernel/io.c > @@ -78,16 +78,19 @@ u32 inl(unsigned long port) > void outb(u8 b, unsigned long port) > { > iowrite8(b, ioport_map(port, 1)); > + mb(); > } > > void outw(u16 b, unsigned long port) > { > iowrite16(b, ioport_map(port, 2)); > + mb(); > } > > void outl(u32 b, unsigned long port) > { > iowrite32(b, ioport_map(port, 4)); > + mb(); > } > > EXPORT_SYMBOL(inb); > @@ -336,6 +339,7 @@ void iowrite8_rep(void __iomem *port, const void > *xsrc, unsigned long count) > void outsb(unsigned long port, const void *src, unsigned long count) > { > iowrite8_rep(ioport_map(port, 1), src, count); > + mb(); > } > > EXPORT_SYMBOL(iowrite8_rep); > @@ -376,6 +380,7 @@ void iowrite16_rep(void __iomem *port, const void > *src, unsigned long count) > void outsw(unsigned long port, const void *src, unsigned long count) > { > iowrite16_rep(ioport_map(port, 2), src, count); > + mb(); > } > > EXPORT_SYMBOL(iowrite16_rep); > @@ -408,6 +413,7 @@ void iowrite32_rep(void __iomem *port, const void > *src, unsigned long count) > void outsl(unsigned long port, const void *src, unsigned long count) > { > iowrite32_rep(ioport_map(port, 4), src, count); > + mb(); > } > > EXPORT_SYMBOL(iowrite32_rep); >
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
On Tue 19-02-19 14:17:09, Meelis Roos wrote: > > > > > The result of the bisection is > > > > > [88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration > > > > > stalls for blkdev pages > > > > > > > > > > Is that result relevant for the problem or should I continue > > > > > bisecting between 4.20.0 and the so far first bad commit? > > > > > > > > Can you try reverting the commit and see if it makes the problem go > > > > away? > > > > > > Tried reverting it on top of 5.0.0-rc6-00153-g5ded5871030e and it seems > > > to make the kernel work - emerge --sync succeeded. > There is more to it. > > After running 5.0.0-rc6-00153-g5ded5871030e-dirty (with the revert of > that patch) successfully for Gentoo update, I upgraded the kernel to > 5.0.0-rc7-00011-gb5372fe5dc84-dirty (todays git + revert of this patch) > and it broke on rsync again: > > RepoStorageException: command exited with status -6: rsync -a --link-dest > /usr/portage --exclude=/distfiles --exclude=/local --exclude=/lost+found > --exclude=/packages --exclude /.tmp-unverified-download-quarantine > /usr/portage/ /usr/portage/.tmp-unverified-download-quarantine/ > > Nothing in dmesg. > > This means the real root reason is somewhere deeper and reverting this > commit just made it less likely to happen. Thanks for information. Yeah, that makes somewhat more sense. Can you ever see the failure if you disable CONFIG_TRANSPARENT_HUGEPAGE? Because your findings still seem to indicate that there' some problem with page migration and Alpha (added MM list to CC). Honza -- Jan Kara SUSE Labs, CR
Re: ext4 corruption on alpha with 4.20.0-09062-gd8372ba8ce28
The result of the bisection is [88dbcbb3a4847f5e6dfeae952d3105497700c128] blkdev: avoid migration stalls for blkdev pages Is that result relevant for the problem or should I continue bisecting between 4.20.0 and the so far first bad commit? Can you try reverting the commit and see if it makes the problem go away? Tried reverting it on top of 5.0.0-rc6-00153-g5ded5871030e and it seems to make the kernel work - emerge --sync succeeded. There is more to it. After running 5.0.0-rc6-00153-g5ded5871030e-dirty (with the revert of that patch) successfully for Gentoo update, I upgraded the kernel to 5.0.0-rc7-00011-gb5372fe5dc84-dirty (todays git + revert of this patch) and it broke on rsync again: RepoStorageException: command exited with status -6: rsync -a --link-dest /usr/portage --exclude=/distfiles --exclude=/local --exclude=/lost+found --exclude=/packages --exclude /.tmp-unverified-download-quarantine /usr/portage/ /usr/portage/.tmp-unverified-download-quarantine/ Nothing in dmesg. This means the real root reason is somewhere deeper and reverting this commit just made it less likely to happen. -- Meelis Roos
Re: [PATCH v4 0/3] locking/rwsem: Rwsem rearchitecture part 0
On Fri, Feb 15, 2019 at 01:58:34PM -0500, Waiman Long wrote: > On 02/15/2019 01:40 PM, Will Deacon wrote: > > On Thu, Feb 14, 2019 at 11:37:15AM +0100, Peter Zijlstra wrote: > >> On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote: > >>> v4: > >>> - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c. > >>> > >>> v3: > >>> - Optimize __down_read_trylock() for the uncontended case as suggested > >>>by Linus. > >>> > >>> v2: > >>> - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ. > >>> - Update performance test data in patch 1. > >>> > >>> The goal of this patchset is to remove the architecture specific files > >>> for rwsem-xadd to make it easer to add enhancements in the later rwsem > >>> patches. It also removes the legacy rwsem-spinlock.c file and make all > >>> the architectures use one single implementation of rwsem - rwsem-xadd.c. > >>> > >>> Waiman Long (3): > >>> locking/rwsem: Remove arch specific rwsem files > >>> locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all > >>> archs > >>> locking/rwsem: Optimize down_read_trylock() > >> Acked-by: Peter Zijlstra (Intel) > >> > >> with the caveat that I'm happy to exchange patch 3 back to my earlier > >> suggestion in case Will expesses concerns wrt the ARM64 performance of > >> Linus' suggestion. > > Right, the current proposal doesn't work well for us, unfortunately. Which > > was your earlier suggestion? > > > > Will > > In my posting yesterday, I showed that most of the trylocks done were > actually uncontended. Assuming that pattern hold for the most of the > workloads, it will not that bad after all. That's fair enough; if you're going to sit in a tight trylock() loop like the benchmark does, then you're much better off just calling lock() if you care at all about scalability. Will
[PATCH 4.19 70/85] alpha: fix page fault handling for r16-r18 targets
4.19-stable review patch. If anyone has any objections, please let me know. -- From: Sergei Trofimovich commit 491af60ffb848b59e82f7c9145833222e0bf27a5 upstream. Fix page fault handling code to fixup r16-r18 registers. Before the patch code had off-by-two registers bug. This bug caused overwriting of ps,pc,gp registers instead of fixing intended r16,r17,r18 (see `struct pt_regs`). More details: Initially Dmitry noticed a kernel bug as a failure on strace test suite. Test passes unmapped userspace pointer to io_submit: ```c #include #include #include #include int main(void) { unsigned long ctx = 0; if (syscall(__NR_io_setup, 1, )) err(1, "io_setup"); const size_t page_size = sysconf(_SC_PAGESIZE); const size_t size = page_size * 2; void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (MAP_FAILED == ptr) err(1, "mmap(%zu)", size); if (munmap(ptr, size)) err(1, "munmap"); syscall(__NR_io_submit, ctx, 1, ptr + page_size); syscall(__NR_io_destroy, ctx); return 0; } ``` Running this test causes kernel to crash when handling page fault: ``` Unable to handle kernel paging request at virtual address 9468 CPU 3 aio(26027): Oops 0 pc = [] ra = [] ps = Not tainted pc is at sys_io_submit+0x108/0x200 ra is at sys_io_submit+0x6c/0x200 v0 = fc00c58e6300 t0 = fff2 t1 = 0225e000 t2 = fc01f159fef8 t3 = fc0001009640 t4 = fce0f6e0 t5 = 020001002e9e t6 = 4c41564e49452031 t7 = fc01f159c000 s0 = 0002 s1 = 0225e000 s2 = s3 = s4 = s5 = fff2 s6 = fc00c58e6300 a0 = fc00c58e6300 a1 = a2 = 0225e000 a3 = 021ac260 a4 = 021ac1e8 a5 = 0001 t8 = 0008 t9 = 00011f8bce30 t10= 021ac440 t11= pv = fc6fd320 at = gp = sp = 265fd174 Disabling lock debugging due to kernel taint Trace: [] entSys+0xa4/0xc0 ``` Here `gp` has invalid value. `gp is s overwritten by a fixup for the following page fault handler in `io_submit` syscall handler: ``` __se_sys_io_submit ... ldq a1,0(t1) bne t0,4280 <__se_sys_io_submit+0x180> ``` After a page fault `t0` should contain -EFALUT and `a1` is 0. Instead `gp` was overwritten in place of `a1`. This happens due to a off-by-two bug in `dpf_reg()` for `r16-r18` (aka `a0-a2`). I think the bug went unnoticed for a long time as `gp` is one of scratch registers. Any kernel function call would re-calculate `gp`. Dmitry tracked down the bug origin back to 2.1.32 kernel version where trap_a{0,1,2} fields were inserted into struct pt_regs. And even before that `dpf_reg()` contained off-by-one error. Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: linux-alpha@vger.kernel.org Cc: linux-ker...@vger.kernel.org Reported-and-reviewed-by: "Dmitry V. Levin" Cc: sta...@vger.kernel.org # v2.1.32+ Bug: https://bugs.gentoo.org/672040 Signed-off-by: Sergei Trofimovich Signed-off-by: Matt Turner Signed-off-by: Greg Kroah-Hartman --- arch/alpha/mm/fault.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -78,7 +78,7 @@ __load_new_mm_context(struct mm_struct * /* Macro for exception fixup code to access integer registers. */ #define dpf_reg(r) \ (((unsigned long *)regs)[(r) <= 8 ? (r) : (r) <= 15 ? (r)-16 : \ -(r) <= 18 ? (r)+8 : (r)-10]) +(r) <= 18 ? (r)+10 : (r)-10]) asmlinkage void do_page_fault(unsigned long address, unsigned long mmcsr,
[PATCH 4.14 50/62] alpha: fix page fault handling for r16-r18 targets
4.14-stable review patch. If anyone has any objections, please let me know. -- From: Sergei Trofimovich commit 491af60ffb848b59e82f7c9145833222e0bf27a5 upstream. Fix page fault handling code to fixup r16-r18 registers. Before the patch code had off-by-two registers bug. This bug caused overwriting of ps,pc,gp registers instead of fixing intended r16,r17,r18 (see `struct pt_regs`). More details: Initially Dmitry noticed a kernel bug as a failure on strace test suite. Test passes unmapped userspace pointer to io_submit: ```c #include #include #include #include int main(void) { unsigned long ctx = 0; if (syscall(__NR_io_setup, 1, )) err(1, "io_setup"); const size_t page_size = sysconf(_SC_PAGESIZE); const size_t size = page_size * 2; void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (MAP_FAILED == ptr) err(1, "mmap(%zu)", size); if (munmap(ptr, size)) err(1, "munmap"); syscall(__NR_io_submit, ctx, 1, ptr + page_size); syscall(__NR_io_destroy, ctx); return 0; } ``` Running this test causes kernel to crash when handling page fault: ``` Unable to handle kernel paging request at virtual address 9468 CPU 3 aio(26027): Oops 0 pc = [] ra = [] ps = Not tainted pc is at sys_io_submit+0x108/0x200 ra is at sys_io_submit+0x6c/0x200 v0 = fc00c58e6300 t0 = fff2 t1 = 0225e000 t2 = fc01f159fef8 t3 = fc0001009640 t4 = fce0f6e0 t5 = 020001002e9e t6 = 4c41564e49452031 t7 = fc01f159c000 s0 = 0002 s1 = 0225e000 s2 = s3 = s4 = s5 = fff2 s6 = fc00c58e6300 a0 = fc00c58e6300 a1 = a2 = 0225e000 a3 = 021ac260 a4 = 021ac1e8 a5 = 0001 t8 = 0008 t9 = 00011f8bce30 t10= 021ac440 t11= pv = fc6fd320 at = gp = sp = 265fd174 Disabling lock debugging due to kernel taint Trace: [] entSys+0xa4/0xc0 ``` Here `gp` has invalid value. `gp is s overwritten by a fixup for the following page fault handler in `io_submit` syscall handler: ``` __se_sys_io_submit ... ldq a1,0(t1) bne t0,4280 <__se_sys_io_submit+0x180> ``` After a page fault `t0` should contain -EFALUT and `a1` is 0. Instead `gp` was overwritten in place of `a1`. This happens due to a off-by-two bug in `dpf_reg()` for `r16-r18` (aka `a0-a2`). I think the bug went unnoticed for a long time as `gp` is one of scratch registers. Any kernel function call would re-calculate `gp`. Dmitry tracked down the bug origin back to 2.1.32 kernel version where trap_a{0,1,2} fields were inserted into struct pt_regs. And even before that `dpf_reg()` contained off-by-one error. Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: linux-alpha@vger.kernel.org Cc: linux-ker...@vger.kernel.org Reported-and-reviewed-by: "Dmitry V. Levin" Cc: sta...@vger.kernel.org # v2.1.32+ Bug: https://bugs.gentoo.org/672040 Signed-off-by: Sergei Trofimovich Signed-off-by: Matt Turner Signed-off-by: Greg Kroah-Hartman --- arch/alpha/mm/fault.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -78,7 +78,7 @@ __load_new_mm_context(struct mm_struct * /* Macro for exception fixup code to access integer registers. */ #define dpf_reg(r) \ (((unsigned long *)regs)[(r) <= 8 ? (r) : (r) <= 15 ? (r)-16 : \ -(r) <= 18 ? (r)+8 : (r)-10]) +(r) <= 18 ? (r)+10 : (r)-10]) asmlinkage void do_page_fault(unsigned long address, unsigned long mmcsr,
[PATCH 4.9 44/58] alpha: fix page fault handling for r16-r18 targets
4.9-stable review patch. If anyone has any objections, please let me know. -- From: Sergei Trofimovich commit 491af60ffb848b59e82f7c9145833222e0bf27a5 upstream. Fix page fault handling code to fixup r16-r18 registers. Before the patch code had off-by-two registers bug. This bug caused overwriting of ps,pc,gp registers instead of fixing intended r16,r17,r18 (see `struct pt_regs`). More details: Initially Dmitry noticed a kernel bug as a failure on strace test suite. Test passes unmapped userspace pointer to io_submit: ```c #include #include #include #include int main(void) { unsigned long ctx = 0; if (syscall(__NR_io_setup, 1, )) err(1, "io_setup"); const size_t page_size = sysconf(_SC_PAGESIZE); const size_t size = page_size * 2; void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (MAP_FAILED == ptr) err(1, "mmap(%zu)", size); if (munmap(ptr, size)) err(1, "munmap"); syscall(__NR_io_submit, ctx, 1, ptr + page_size); syscall(__NR_io_destroy, ctx); return 0; } ``` Running this test causes kernel to crash when handling page fault: ``` Unable to handle kernel paging request at virtual address 9468 CPU 3 aio(26027): Oops 0 pc = [] ra = [] ps = Not tainted pc is at sys_io_submit+0x108/0x200 ra is at sys_io_submit+0x6c/0x200 v0 = fc00c58e6300 t0 = fff2 t1 = 0225e000 t2 = fc01f159fef8 t3 = fc0001009640 t4 = fce0f6e0 t5 = 020001002e9e t6 = 4c41564e49452031 t7 = fc01f159c000 s0 = 0002 s1 = 0225e000 s2 = s3 = s4 = s5 = fff2 s6 = fc00c58e6300 a0 = fc00c58e6300 a1 = a2 = 0225e000 a3 = 021ac260 a4 = 021ac1e8 a5 = 0001 t8 = 0008 t9 = 00011f8bce30 t10= 021ac440 t11= pv = fc6fd320 at = gp = sp = 265fd174 Disabling lock debugging due to kernel taint Trace: [] entSys+0xa4/0xc0 ``` Here `gp` has invalid value. `gp is s overwritten by a fixup for the following page fault handler in `io_submit` syscall handler: ``` __se_sys_io_submit ... ldq a1,0(t1) bne t0,4280 <__se_sys_io_submit+0x180> ``` After a page fault `t0` should contain -EFALUT and `a1` is 0. Instead `gp` was overwritten in place of `a1`. This happens due to a off-by-two bug in `dpf_reg()` for `r16-r18` (aka `a0-a2`). I think the bug went unnoticed for a long time as `gp` is one of scratch registers. Any kernel function call would re-calculate `gp`. Dmitry tracked down the bug origin back to 2.1.32 kernel version where trap_a{0,1,2} fields were inserted into struct pt_regs. And even before that `dpf_reg()` contained off-by-one error. Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: linux-alpha@vger.kernel.org Cc: linux-ker...@vger.kernel.org Reported-and-reviewed-by: "Dmitry V. Levin" Cc: sta...@vger.kernel.org # v2.1.32+ Bug: https://bugs.gentoo.org/672040 Signed-off-by: Sergei Trofimovich Signed-off-by: Matt Turner Signed-off-by: Greg Kroah-Hartman --- arch/alpha/mm/fault.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -77,7 +77,7 @@ __load_new_mm_context(struct mm_struct * /* Macro for exception fixup code to access integer registers. */ #define dpf_reg(r) \ (((unsigned long *)regs)[(r) <= 8 ? (r) : (r) <= 15 ? (r)-16 : \ -(r) <= 18 ? (r)+8 : (r)-10]) +(r) <= 18 ? (r)+10 : (r)-10]) asmlinkage void do_page_fault(unsigned long address, unsigned long mmcsr,