Re: svn commit: r333703 - head/sys/vm

2018-05-17 Thread Andrew Gallatin

On 05/17/18 14:07, Mark Johnston wrote:

On Thu, May 17, 2018 at 10:07:34AM -0700, Conrad Meyer wrote:

On Wed, May 16, 2018 at 9:27 PM, Mark Johnston  wrote:

Author: markj
Date: Thu May 17 04:27:08 2018
New Revision: 333703
URL: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_333703=DwIBAg=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A=6lhtci2MYxtyrK5Ub70QC0DcEiQ77Ry2LTAb6cDtW5A=z0SOGvNGORjI-SySfy-aovuyFzy_K5CtCfbNeWbRGLA=

Log:
   Fix a race in vm_page_pagequeue_lockptr().

   The value of m->queue must be cached after comparing it with PQ_NONE,
   since it may be concurrently changing.

   Reported by:  glebius


What were the symptoms of this issue?  The test plan in the linked
phabricator revision says:

"Gleb reported seeing panics as a result of the use of a bogus index
into the pagequeue array, and also reported that this patch fixed the
panics."

So an attempt to lock pagequeues[PQ_NONE=255].pq_mutex, which is
either something later in the vm_domain object, or bogus memory?  One
of the mtx asserts trips?


I think it was "mtx_lock() of spin mutex"; I didn't get a lot of
details.

I failed to note in the commit message that this race was introduced in
r332974.



The most common stack was:


panic: mtx_lock() of spin mutex (null) @ 
/data/ocafirmware.alt/FreeBSD/sys/vm/vm_page.c:3344

cpuid = 4
time = 1526415167
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
0xfe158af62380

vpanic() at vpanic+0x1a3/frame 0xfe158af623e0
doadump() at doadump/frame 0xfe158af62460
__mtx_lock_flags() at __mtx_lock_flags+0x11a/frame 0xfe158af624a0
vm_page_dequeue() at vm_page_dequeue+0x8a/frame 0xfe158af624e0
vm_page_alloc_domain_after() at vm_page_alloc_domain_after+0x2cb/frame 
0xfe158af62560

vm_page_grab_pages() at vm_page_grab_pages+0x274/frame 0xfe158af62610
vn_sendfile() at vn_sendfile+0x83a/frame 0xfe158af628e0
[Tue May 15 20:12:48 2018]sys_sendfile() at sys_sendfile+0x119/frame 
0xfe158af62980

amd64_syscall() at amd64_syscall+0x298/frame 0xfe158af62ab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfe158af62ab0



I once saw one like this:


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer = 0x20:0x8088bf74
stack pointer   = 0x28:0xfe55af7712e0
frame pointer   = 0x28:0xfe55af771330
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 12 (irq446: mlx5_core0)
[Mon May 14 04:45:10 2018]trap number   = 9
panic: general protection fault
cpuid = 0
time = 1526273109
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
0xfe55af770ff0

vpanic() at vpanic+0x1a3/frame 0xfe55af771050
panic() at panic+0x43/frame 0xfe55af7710b0
trap_fatal() at trap_fatal+0x35f/frame 0xfe55af771100
trap() at trap+0x6d/frame 0xfe55af771210
[Mon May 14 04:45:10 2018]calltrap() at calltrap+0x8/frame 
0xfe55af771210
--- trap 0x9, rip = 0x8088bf74, rsp = 0xfe55af7712e0, rbp = 
0xfe55af771330 ---
vm_pqbatch_submit_page() at vm_pqbatch_submit_page+0x144/frame 
0xfe55af771330

sendfile_free_page() at sendfile_free_page+0x10e/frame 0xfe55af771360
sendfile_free_mext_pg() at sendfile_free_mext_pg+0xb7/frame 
0xfe55af7713b0

mb_free_ext() at mb_free_ext+0x103/frame 0xfe55af7713e0
m_freem() at m_freem+0x48/frame 0xfe55af771400
tcp_do_segment() at tcp_do_segment+0x1647/frame 0xfe55af771500
tcp_input_with_port() at tcp_input_with_port+0xfcc/frame 0xfe55af771650
tcp_input() at tcp_input+0xb/frame 0xfe55af771660
[Mon May 14 04:45:10 2018]ip_input() at ip_input+0xe9/frame 
0xfe55af7716c0

netisr_dispatch_src() at netisr_dispatch_src+0xa8/frame 0xfe55af771710
ether_demux() at ether_demux+0x140/frame 0xfe55af771740
ether_nh_input() at ether_nh_input+0x32c/frame 0xfe55af7717a0
netisr_dispatch_src() at netisr_dispatch_src+0xa8/frame 0xfe55af7717f0
ether_input() at ether_input+0x26/frame 0xfe55af771810
tcp_lro_flush_all() at tcp_lro_flush_all+0xf2/frame 0xfe55af771850
mlx5e_rx_cq_comp() at mlx5e_rx_cq_comp+0x5e5/frame 0xfe55af771950
mlx5_cq_completion() at mlx5_cq_completion+0x73/frame 0xfe55af771990
<...>

Thanks again for fixing it so quickly!

Drew
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r333703 - head/sys/vm

2018-05-17 Thread Mark Johnston
On Thu, May 17, 2018 at 10:07:34AM -0700, Conrad Meyer wrote:
> On Wed, May 16, 2018 at 9:27 PM, Mark Johnston  wrote:
> > Author: markj
> > Date: Thu May 17 04:27:08 2018
> > New Revision: 333703
> > URL: https://svnweb.freebsd.org/changeset/base/333703
> >
> > Log:
> >   Fix a race in vm_page_pagequeue_lockptr().
> >
> >   The value of m->queue must be cached after comparing it with PQ_NONE,
> >   since it may be concurrently changing.
> >
> >   Reported by:  glebius
> 
> What were the symptoms of this issue?  The test plan in the linked
> phabricator revision says:
> 
> "Gleb reported seeing panics as a result of the use of a bogus index
> into the pagequeue array, and also reported that this patch fixed the
> panics."
> 
> So an attempt to lock pagequeues[PQ_NONE=255].pq_mutex, which is
> either something later in the vm_domain object, or bogus memory?  One
> of the mtx asserts trips?

I think it was "mtx_lock() of spin mutex"; I didn't get a lot of
details.

I failed to note in the commit message that this race was introduced in
r332974.
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r333703 - head/sys/vm

2018-05-17 Thread Conrad Meyer
On Wed, May 16, 2018 at 9:27 PM, Mark Johnston  wrote:
> Author: markj
> Date: Thu May 17 04:27:08 2018
> New Revision: 333703
> URL: https://svnweb.freebsd.org/changeset/base/333703
>
> Log:
>   Fix a race in vm_page_pagequeue_lockptr().
>
>   The value of m->queue must be cached after comparing it with PQ_NONE,
>   since it may be concurrently changing.
>
>   Reported by:  glebius

What were the symptoms of this issue?  The test plan in the linked
phabricator revision says:

"Gleb reported seeing panics as a result of the use of a bogus index
into the pagequeue array, and also reported that this patch fixed the
panics."

So an attempt to lock pagequeues[PQ_NONE=255].pq_mutex, which is
either something later in the vm_domain object, or bogus memory?  One
of the mtx asserts trips?

Thanks,
Conrad
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"