Re: Kernel 4.17.4 lockup

2018-07-23 Thread H.J. Lu
On Fri, Jul 20, 2018 at 2:35 PM, Andy Lutomirski wrote: > >> On Jul 16, 2018, at 6:05 AM, H.J. Lu wrote: >> >>> On Fri, Jul 13, 2018 at 7:08 PM, Andy Lutomirski >>> wrote: >>> I'm not at all convinced that this is the problem, but the series here >>> will give a better diagnostic if the issue

Re: Kernel 4.17.4 lockup

2018-07-23 Thread H.J. Lu
On Fri, Jul 20, 2018 at 2:35 PM, Andy Lutomirski wrote: > >> On Jul 16, 2018, at 6:05 AM, H.J. Lu wrote: >> >>> On Fri, Jul 13, 2018 at 7:08 PM, Andy Lutomirski >>> wrote: >>> I'm not at all convinced that this is the problem, but the series here >>> will give a better diagnostic if the issue

Re: Kernel 4.17.4 lockup

2018-07-20 Thread Andy Lutomirski
> On Jul 16, 2018, at 6:05 AM, H.J. Lu wrote: > >> On Fri, Jul 13, 2018 at 7:08 PM, Andy Lutomirski wrote: >> I'm not at all convinced that this is the problem, but the series here >> will give a better diagnostic if the issue really is an IRQ stack >> overflow: >> >>

Re: Kernel 4.17.4 lockup

2018-07-20 Thread Andy Lutomirski
> On Jul 16, 2018, at 6:05 AM, H.J. Lu wrote: > >> On Fri, Jul 13, 2018 at 7:08 PM, Andy Lutomirski wrote: >> I'm not at all convinced that this is the problem, but the series here >> will give a better diagnostic if the issue really is an IRQ stack >> overflow: >> >>

Re: Kernel 4.17.4 lockup

2018-07-16 Thread H.J. Lu
On Fri, Jul 13, 2018 at 7:08 PM, Andy Lutomirski wrote: > I'm not at all convinced that this is the problem, but the series here > will give a better diagnostic if the issue really is an IRQ stack > overflow: > >

Re: Kernel 4.17.4 lockup

2018-07-16 Thread H.J. Lu
On Fri, Jul 13, 2018 at 7:08 PM, Andy Lutomirski wrote: > I'm not at all convinced that this is the problem, but the series here > will give a better diagnostic if the issue really is an IRQ stack > overflow: > >

Re: Kernel 4.17.4 lockup

2018-07-13 Thread Andy Lutomirski
I'm not at all convinced that this is the problem, but the series here will give a better diagnostic if the issue really is an IRQ stack overflow: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/guard_pages (link currently broken. should work soon.)

Re: Kernel 4.17.4 lockup

2018-07-13 Thread Andy Lutomirski
I'm not at all convinced that this is the problem, but the series here will give a better diagnostic if the issue really is an IRQ stack overflow: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/guard_pages (link currently broken. should work soon.)

Re: Kernel 4.17.4 lockup

2018-07-12 Thread H.J. Lu
On Thu, Jul 12, 2018 at 7:44 AM, H.J. Lu wrote: > On Wed, Jul 11, 2018 at 4:14 PM, Dave Hansen wrote: >> On 07/11/2018 04:07 PM, Andy Lutomirski wrote: >>> Could the cause be an overflow of the IRQ stack? I’ve been meaning >>> to put guard pages on all the special stacks for a while. Let me see

Re: Kernel 4.17.4 lockup

2018-07-12 Thread H.J. Lu
On Thu, Jul 12, 2018 at 7:44 AM, H.J. Lu wrote: > On Wed, Jul 11, 2018 at 4:14 PM, Dave Hansen wrote: >> On 07/11/2018 04:07 PM, Andy Lutomirski wrote: >>> Could the cause be an overflow of the IRQ stack? I’ve been meaning >>> to put guard pages on all the special stacks for a while. Let me see

Re: Kernel 4.17.4 lockup

2018-07-12 Thread H.J. Lu
On Wed, Jul 11, 2018 at 4:14 PM, Dave Hansen wrote: > On 07/11/2018 04:07 PM, Andy Lutomirski wrote: >> Could the cause be an overflow of the IRQ stack? I’ve been meaning >> to put guard pages on all the special stacks for a while. Let me see >> if I can do that in the next couple days. > > But

Re: Kernel 4.17.4 lockup

2018-07-12 Thread H.J. Lu
On Wed, Jul 11, 2018 at 4:14 PM, Dave Hansen wrote: > On 07/11/2018 04:07 PM, Andy Lutomirski wrote: >> Could the cause be an overflow of the IRQ stack? I’ve been meaning >> to put guard pages on all the special stacks for a while. Let me see >> if I can do that in the next couple days. > > But

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Hansen
On 07/11/2018 04:07 PM, Andy Lutomirski wrote: > Could the cause be an overflow of the IRQ stack? I’ve been meaning > to put guard pages on all the special stacks for a while. Let me see > if I can do that in the next couple days. But what would that overflow into? Wouldn't it most likely be

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Hansen
On 07/11/2018 04:07 PM, Andy Lutomirski wrote: > Could the cause be an overflow of the IRQ stack? I’ve been meaning > to put guard pages on all the special stacks for a while. Let me see > if I can do that in the next couple days. But what would that overflow into? Wouldn't it most likely be

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Andy Lutomirski
> On Jul 11, 2018, at 11:31 AM, Dave Jones wrote: > >> On Wed, Jul 11, 2018 at 10:50:22AM -0700, Dave Hansen wrote: >> On 07/11/2018 10:29 AM, H.J. Lu wrote: I have seen it on machines with various amounts of cores and RAMs. It triggers the fastest on 8 cores with 6GB RAM reliably.

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Andy Lutomirski
> On Jul 11, 2018, at 11:31 AM, Dave Jones wrote: > >> On Wed, Jul 11, 2018 at 10:50:22AM -0700, Dave Hansen wrote: >> On 07/11/2018 10:29 AM, H.J. Lu wrote: I have seen it on machines with various amounts of cores and RAMs. It triggers the fastest on 8 cores with 6GB RAM reliably.

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Jones
On Wed, Jul 11, 2018 at 10:50:22AM -0700, Dave Hansen wrote: > On 07/11/2018 10:29 AM, H.J. Lu wrote: > >> I have seen it on machines with various amounts of cores and RAMs. > >> It triggers the fastest on 8 cores with 6GB RAM reliably. > > Here is the first kernel message. > > This looks

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Jones
On Wed, Jul 11, 2018 at 10:50:22AM -0700, Dave Hansen wrote: > On 07/11/2018 10:29 AM, H.J. Lu wrote: > >> I have seen it on machines with various amounts of cores and RAMs. > >> It triggers the fastest on 8 cores with 6GB RAM reliably. > > Here is the first kernel message. > > This looks

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Hansen
On 07/11/2018 10:29 AM, H.J. Lu wrote: >> I have seen it on machines with various amounts of cores and RAMs. >> It triggers the fastest on 8 cores with 6GB RAM reliably. > Here is the first kernel message. This looks like random corruption again. It's probably a bogus 'struct page' that fails

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Hansen
On 07/11/2018 10:29 AM, H.J. Lu wrote: >> I have seen it on machines with various amounts of cores and RAMs. >> It triggers the fastest on 8 cores with 6GB RAM reliably. > Here is the first kernel message. This looks like random corruption again. It's probably a bogus 'struct page' that fails

Re: Kernel 4.17.4 lockup

2018-07-11 Thread H.J. Lu
On Wed, Jul 11, 2018 at 10:43 AM, Dave Hansen wrote: > On 07/11/2018 10:29 AM, H.J. Lu wrote: >>> I have seen it on machines with various amounts of cores and RAMs. >>> It triggers the fastest on 8 cores with 6GB RAM reliably. >> Here is the first kernel message. > > Does it trigger better with

Re: Kernel 4.17.4 lockup

2018-07-11 Thread H.J. Lu
On Wed, Jul 11, 2018 at 10:43 AM, Dave Hansen wrote: > On 07/11/2018 10:29 AM, H.J. Lu wrote: >>> I have seen it on machines with various amounts of cores and RAMs. >>> It triggers the fastest on 8 cores with 6GB RAM reliably. >> Here is the first kernel message. > > Does it trigger better with

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Hansen
On 07/11/2018 10:29 AM, H.J. Lu wrote: >> I have seen it on machines with various amounts of cores and RAMs. >> It triggers the fastest on 8 cores with 6GB RAM reliably. > Here is the first kernel message. Does it trigger better with more RAM or less?

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Hansen
On 07/11/2018 10:29 AM, H.J. Lu wrote: >> I have seen it on machines with various amounts of cores and RAMs. >> It triggers the fastest on 8 cores with 6GB RAM reliably. > Here is the first kernel message. Does it trigger better with more RAM or less?

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Andy Lutomirski
On Wed, Jul 11, 2018 at 10:29 AM, H.J. Lu wrote: > On Wed, Jul 11, 2018 at 9:53 AM, H.J. Lu wrote: >> On Wed, Jul 11, 2018 at 9:49 AM, Dave Hansen wrote: >>> On 07/11/2018 09:29 AM, H.J. Lu wrote: >> # It takes about 3 hour to bootstrap x86-64 GCC and 3 hour to run tests, >> TIMEOUT=480

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Andy Lutomirski
On Wed, Jul 11, 2018 at 10:29 AM, H.J. Lu wrote: > On Wed, Jul 11, 2018 at 9:53 AM, H.J. Lu wrote: >> On Wed, Jul 11, 2018 at 9:49 AM, Dave Hansen wrote: >>> On 07/11/2018 09:29 AM, H.J. Lu wrote: >> # It takes about 3 hour to bootstrap x86-64 GCC and 3 hour to run tests, >> TIMEOUT=480

Re: Kernel 4.17.4 lockup

2018-07-11 Thread H.J. Lu
On Wed, Jul 11, 2018 at 9:49 AM, Dave Hansen wrote: > On 07/11/2018 09:29 AM, H.J. Lu wrote: # It takes about 3 hour to bootstrap x86-64 GCC and 3 hour to run tests, TIMEOUT=480 # Run it every hour, 30 * * * * /export/gnu/import/git/gcc-test-x32/gcc-build -mx32 --with-pic

Re: Kernel 4.17.4 lockup

2018-07-11 Thread H.J. Lu
On Wed, Jul 11, 2018 at 9:49 AM, Dave Hansen wrote: > On 07/11/2018 09:29 AM, H.J. Lu wrote: # It takes about 3 hour to bootstrap x86-64 GCC and 3 hour to run tests, TIMEOUT=480 # Run it every hour, 30 * * * * /export/gnu/import/git/gcc-test-x32/gcc-build -mx32 --with-pic

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Hansen
On 07/11/2018 09:29 AM, H.J. Lu wrote: >>> # It takes about 3 hour to bootstrap x86-64 GCC and 3 hour to run tests, >>> TIMEOUT=480 >>> # Run it every hour, >>> 30 * * * * /export/gnu/import/git/gcc-test-x32/gcc-build -mx32 >>> --with-pic > /dev/null 2>&1 >> Oh, fun, one of those. >> >> How long

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Hansen
On 07/11/2018 09:29 AM, H.J. Lu wrote: >>> # It takes about 3 hour to bootstrap x86-64 GCC and 3 hour to run tests, >>> TIMEOUT=480 >>> # Run it every hour, >>> 30 * * * * /export/gnu/import/git/gcc-test-x32/gcc-build -mx32 >>> --with-pic > /dev/null 2>&1 >> Oh, fun, one of those. >> >> How long

Re: Kernel 4.17.4 lockup

2018-07-11 Thread H.J. Lu
On Wed, Jul 11, 2018 at 9:24 AM, Dave Hansen wrote: > On 07/11/2018 08:40 AM, H.J. Lu wrote: >> This is a quad-core machine with HT and 6 GB RAM. The workload is >> x32 GCC build and test with "make -j8". The bug is triggered during GCC >> test after a couple hours. I have a script to set up

Re: Kernel 4.17.4 lockup

2018-07-11 Thread H.J. Lu
On Wed, Jul 11, 2018 at 9:24 AM, Dave Hansen wrote: > On 07/11/2018 08:40 AM, H.J. Lu wrote: >> This is a quad-core machine with HT and 6 GB RAM. The workload is >> x32 GCC build and test with "make -j8". The bug is triggered during GCC >> test after a couple hours. I have a script to set up

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Hansen
On 07/11/2018 08:40 AM, H.J. Lu wrote: > This is a quad-core machine with HT and 6 GB RAM. The workload is > x32 GCC build and test with "make -j8". The bug is triggered during GCC > test after a couple hours. I have a script to set up my workload: > >

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Hansen
On 07/11/2018 08:40 AM, H.J. Lu wrote: > This is a quad-core machine with HT and 6 GB RAM. The workload is > x32 GCC build and test with "make -j8". The bug is triggered during GCC > test after a couple hours. I have a script to set up my workload: > >

Re: Kernel 4.17.4 lockup

2018-07-11 Thread H.J. Lu
On Wed, Jul 11, 2018 at 8:13 AM, Dave Hansen wrote: > On 07/11/2018 07:56 AM, H.J. Lu wrote: >> On Mon, Jul 9, 2018 at 8:47 PM, Dave Hansen wrote: >>> On 07/09/2018 07:14 PM, H.J. Lu wrote: > I'd really want to see this reproduced without KASLR to make the oops > easier to read. It

Re: Kernel 4.17.4 lockup

2018-07-11 Thread H.J. Lu
On Wed, Jul 11, 2018 at 8:13 AM, Dave Hansen wrote: > On 07/11/2018 07:56 AM, H.J. Lu wrote: >> On Mon, Jul 9, 2018 at 8:47 PM, Dave Hansen wrote: >>> On 07/09/2018 07:14 PM, H.J. Lu wrote: > I'd really want to see this reproduced without KASLR to make the oops > easier to read. It

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Hansen
On 07/11/2018 07:56 AM, H.J. Lu wrote: > On Mon, Jul 9, 2018 at 8:47 PM, Dave Hansen wrote: >> On 07/09/2018 07:14 PM, H.J. Lu wrote: I'd really want to see this reproduced without KASLR to make the oops easier to read. It would also be handy to try your workload with all the

Re: Kernel 4.17.4 lockup

2018-07-11 Thread Dave Hansen
On 07/11/2018 07:56 AM, H.J. Lu wrote: > On Mon, Jul 9, 2018 at 8:47 PM, Dave Hansen wrote: >> On 07/09/2018 07:14 PM, H.J. Lu wrote: I'd really want to see this reproduced without KASLR to make the oops easier to read. It would also be handy to try your workload with all the

Re: Kernel 4.17.4 lockup

2018-07-09 Thread Dave Hansen
On 07/09/2018 07:14 PM, H.J. Lu wrote: >> I'd really want to see this reproduced without KASLR to make the oops >> easier to read. It would also be handy to try your workload with all >> the pedantic debugging: KASAN, slab debugging, DEBUG_PAGE_ALLOC, etc... >> and see if it still triggers. > How

Re: Kernel 4.17.4 lockup

2018-07-09 Thread Dave Hansen
On 07/09/2018 07:14 PM, H.J. Lu wrote: >> I'd really want to see this reproduced without KASLR to make the oops >> easier to read. It would also be handy to try your workload with all >> the pedantic debugging: KASAN, slab debugging, DEBUG_PAGE_ALLOC, etc... >> and see if it still triggers. > How

Re: Kernel 4.17.4 lockup

2018-07-09 Thread H.J. Lu
On Mon, Jul 9, 2018 at 5:44 PM, Dave Hansen wrote: > ... cc'ing a few folks who I know have been looking at this code > lately. The full oops is below if any of you want to take a look. > > OK, well, annotating the disassembly a bit: > >> (gdb) disass free_pages_and_swap_cache >> Dump of

Re: Kernel 4.17.4 lockup

2018-07-09 Thread H.J. Lu
On Mon, Jul 9, 2018 at 5:44 PM, Dave Hansen wrote: > ... cc'ing a few folks who I know have been looking at this code > lately. The full oops is below if any of you want to take a look. > > OK, well, annotating the disassembly a bit: > >> (gdb) disass free_pages_and_swap_cache >> Dump of

Re: Kernel 4.17.4 lockup

2018-07-09 Thread Dave Hansen
... cc'ing a few folks who I know have been looking at this code lately. The full oops is below if any of you want to take a look. OK, well, annotating the disassembly a bit: > (gdb) disass free_pages_and_swap_cache > Dump of assembler code for function free_pages_and_swap_cache: >

Re: Kernel 4.17.4 lockup

2018-07-09 Thread Dave Hansen
... cc'ing a few folks who I know have been looking at this code lately. The full oops is below if any of you want to take a look. OK, well, annotating the disassembly a bit: > (gdb) disass free_pages_and_swap_cache > Dump of assembler code for function free_pages_and_swap_cache: >

Re: Kernel 4.17.4 lockup

2018-07-09 Thread H.J. Lu
On Mon, Jul 9, 2018 at 7:54 AM, Dave Hansen wrote: > On 07/09/2018 06:19 AM, Lu, Hongjiu wrote: >> On 3 x86-64 machines, kernel 4.17.4 locked up under heavy load. 2 of > them don't have any kernel messages. One has > > Hi H.J., > > It'd be really handy if you could pastebin things like this, or

Re: Kernel 4.17.4 lockup

2018-07-09 Thread H.J. Lu
On Mon, Jul 9, 2018 at 7:54 AM, Dave Hansen wrote: > On 07/09/2018 06:19 AM, Lu, Hongjiu wrote: >> On 3 x86-64 machines, kernel 4.17.4 locked up under heavy load. 2 of > them don't have any kernel messages. One has > > Hi H.J., > > It'd be really handy if you could pastebin things like this, or

Re: Kernel 4.17.4 lockup

2018-07-09 Thread Dave Hansen
On 07/09/2018 06:19 AM, Lu, Hongjiu wrote: > On 3 x86-64 machines, kernel 4.17.4 locked up under heavy load. 2 of them don't have any kernel messages. One has Hi H.J., It'd be really handy if you could pastebin things like this, or attach a text file with the oops. Your email wrapped the heck

Re: Kernel 4.17.4 lockup

2018-07-09 Thread Dave Hansen
On 07/09/2018 06:19 AM, Lu, Hongjiu wrote: > On 3 x86-64 machines, kernel 4.17.4 locked up under heavy load. 2 of them don't have any kernel messages. One has Hi H.J., It'd be really handy if you could pastebin things like this, or attach a text file with the oops. Your email wrapped the heck

Kernel 4.17.4 lockup

2018-07-08 Thread H.J. Lu
On 3 x86-64 machines, kernel 4.17.4 locked up under heavy load. 2 of them don't have any kernel messages. One has Jul 05 14:33:32 gnu-hsw-1.sc.intel.com kernel: general protection fault: [#1] SMP PTI Jul 05 14:33:32 gnu-hsw-1.sc.intel.com kernel: Modules linked in: rpcsec_gss_krb5 nfsv4

Kernel 4.17.4 lockup

2018-07-08 Thread H.J. Lu
On 3 x86-64 machines, kernel 4.17.4 locked up under heavy load. 2 of them don't have any kernel messages. One has Jul 05 14:33:32 gnu-hsw-1.sc.intel.com kernel: general protection fault: [#1] SMP PTI Jul 05 14:33:32 gnu-hsw-1.sc.intel.com kernel: Modules linked in: rpcsec_gss_krb5 nfsv4