Pawel Jakub Dawidek,
Before I (if I) spend some reviewing this item,
the in my experience the slab allocator on
low memory does:
IMO, in my experience.. your mileage may vary.
Hopefully this is review. For the less knowledgeable
people the underlying Memory allocator is known
as the Slab Allocator by Jeff Bonwick and should
be able to be found if "googled".
First, just because you call sleep, doesn't mean
that you will sleep. It just means that when you
return you are ?guranteed? to have memory. And that
you are not calling it from interrupt context.
The _alloc call is the front end of the allocator.
First, running "low on memory" could simply mean
that most/alot of available memory has been cached. The
question is if any of the freed objects have
references to them. Most of the time, they have
been freed from the front end and will be re-allocated
on demand / when necessary to remove the overhead of
its resuse to the same slab.
Checks within the slab for any cached objects and
allocates from the cached objects. If the slab
is empty, then memory is attempted to be retrieved
from a freelist. If the freelist is empty then
the backend attempts to reclaim memory.
Now I will give someone WITHIN Sun time to review
your trace and respond.. If they don't, I will
consider a further review of this..
Mitchell Erblich
---------------
Pawel Jakub Dawidek wrote:
>
> Hi.
>
> Kris Kennaway <kris at FreeBSD.org> found a deadlock, which I think is not
> FreeBSD-specific.
>
> When we are running low in memory and kmem_alloc(KM_SLEEP) is called,
> the thread waits for the memory to be reclaimed, right?
> In such situation the arc_reclaim_thread thread is woken up.
>
> Ok. I've two threads waiting for the memory to be freed:
>
> First one, and this one is not really problematic:
>
> arc_lowmem(0,0,a054c56b,12c,a0a74088,...) at arc_lowmem+0x74
> kmem_malloc(a0a71090,20000,2,e6768840,a047e7e5,...) at kmem_malloc+0x131
> page_alloc(0,20000,e6768833,2,aedd9d20,...) at page_alloc+0x27
> uma_large_malloc(20000,2,0,0,0,...) at uma_large_malloc+0x55
> malloc(20000,aedd6080,2,e6768888,aedb7159,...) at malloc+0x120
> zfs_kmem_alloc(20000,2,e67688b8,aed791db,20000,...) at zfs_kmem_alloc+0x13
> zio_data_buf_alloc(20000,aedd9cc0,20000,1,20000,...) at zio_data_buf_alloc+0xd
> arc_get_data_buf(ae166dc0,2,ca220690,b1f37450,e6768928,...) at
> arc_get_data_buf+0x23f
> arc_buf_alloc(ae18d000,20000,ca220690,1,0,...) at arc_buf_alloc+0x9a
> dbuf_read(ca220690,b1f37450,2,bf0933a0,1a7600,...) at dbuf_read+0xf4
> dmu_tx_check_ioerr(0,d,0,a0a74880,0,...) at dmu_tx_check_ioerr+0x6c
> dmu_tx_count_write(197600,0,10000,0,197600,...) at dmu_tx_count_write+0x3ce
> dmu_tx_hold_write(bad1f800,5d52,0,197600,0,...) at dmu_tx_hold_write+0x50
> zfs_freebsd_write(e6768b90,a055a4d5,0,0,0,...) at zfs_freebsd_write+0x1cf
> VOP_WRITE_APV(aedd8540,e6768b90,b608b1d0,a053f500,241,...) at
> VOP_WRITE_APV+0x17c
> vn_write(ae95e630,e6768c58,c1a72680,0,b608b1d0,...) at vn_write+0x250
> dofilewrite(b608b1d0,4,ae95e630,e6768c58,ffffffff,...) at dofilewrite+0x9e
> kern_writev(b608b1d0,4,e6768c58,805f000,10000,...) at kern_writev+0x60
> write(b608b1d0,e6768d00,c,e6768c94,a034f435,...) at write+0x4f
> syscall(e6768d38) at syscall+0x2f3
>
> And second one, which holds arc_buf_t->b_hdr->hash_lock:
>
> arc_lowmem(0,0,a054c56b,12c,a0a74088,...) at arc_lowmem+0x1c
> kmem_malloc(a0a71090,20000,2,e69888b8,a047e7e5,...) at kmem_malloc+0x131
> page_alloc(0,20000,e69888ab,2,aedd9da0,...) at page_alloc+0x27
> uma_large_malloc(20000,2,0,0,0,...) at uma_large_malloc+0x55
> malloc(20000,aedd6080,2,e6988900,aedb7159,...) at malloc+0x120
> zfs_kmem_alloc(20000,2,e6988930,aed791db,20000,...) at zfs_kmem_alloc+0x13
> zio_data_buf_alloc(20000,aedd9cc0,20000,1,20000,...) at zio_data_buf_alloc+0xd
> arc_get_data_buf(ae166dc0,2,20000,0,b8644cf8,...) at arc_get_data_buf+0x23f
> arc_read(c29ec228,ae18d000,af885080,aed80b6c,aed7d168,...) at arc_read+0x33d
> dbuf_read(baf49460,c29ec228,12,c5f8a528,c6254cb0,...) at dbuf_read+0x463
> dmu_buf_hold_array_by_dnode(20000,0,400,0,1,...) at
> dmu_buf_hold_array_by_dnode+0x1b0
> dmu_buf_hold_array(58ef,0,20000,0,400,...) at dmu_buf_hold_array+0x4c
> dmu_read_uio(c23da3c0,58ef,0,e6988c58,400,...) at dmu_read_uio+0x35
> zfs_freebsd_read(e6988b90,a055a48c,adbbd6c0,adbbd6c0,adbbd6c0,...) at
> zfs_freebsd_read+0x3d8
> VOP_READ_APV(aedd8540,e6988b90,b6f4eae0,a053f500,202,...) at VOP_READ_APV+0xd2
> vn_read(adbbd6c0,e6988c58,b0c58e00,0,b6f4eae0,...) at vn_read+0x297
> dofileread(b6f4eae0,3,adbbd6c0,e6988c58,ffffffff,...) at dofileread+0xa7
> kern_readv(b6f4eae0,3,e6988c58,9f7fc670,400,...) at kern_readv+0x60
> read(b6f4eae0,e6988d00,c,e6988c94,a034f435,...) at read+0x4f
> syscall(e6988d38) at syscall+0x2f3
>
> The arc_reclaim_thread thread deadlocks here:
>
> sched_switch(ae8f9910,0,1,17a,a05b6e14,...) at sched_switch+0x16c
> mi_switch(1,0,a0537906,1cc,aede1b8c,...) at mi_switch+0x306
> sleepq_switch(aede1b8c,0,a0537906,21e,e611abe0,...) at sleepq_switch+0x113
> sleepq_wait(aede1b8c,0,aedcef22,3,0,...) at sleepq_wait+0x65
> _sx_xlock_hard(aede1b8c,ae8f9910,aedcecb7,447,e611ac1c,...) at
> _sx_xlock_hard+0x17e
> _sx_xlock(aede1b8c,aedcecb7,447,0,0,...) at _sx_xlock+0x69
> arc_buf_remove_ref(b4288ca8,cc943c94,519,cc943cd0,cc943dac,...) at
> arc_buf_remove_ref+0x58
> dbuf_rele(cc943c94,cc943dac,cc95d348,cc943dac,35,...) at dbuf_rele+0x195
> dbuf_clear(cc943dac,cc943dac,e611ac80,aed7cea0,cc943dac,...) at
> dbuf_clear+0x7f
> dbuf_evict(cc943dac,cc99a3d4,e611ac90,aed788b5,cc99a3d4,...) at dbuf_evict+0xd
> dbuf_do_evict(cc99a3d4,4879,e611acfc,aed78f0f,64,...) at dbuf_do_evict+0x44
> arc_do_user_evicts(64,0,246,a056b498,1,...) at arc_do_user_evicts+0x51
> arc_reclaim_thread(0,e611ad38,a0531203,326,ae8f76c0,...) at
> arc_reclaim_thread+0x36b
> fork_exit(aed78ba4,0,e611ad38) at fork_exit+0xd1
> fork_trampoline() at fork_trampoline+0x8
>
> Let me convert the offsets to file:line as found in OpenSolaris code:
>
> arc.c:1088 arc_buf_remove_ref+0x58
> dbuf.c:1710/1713 dbuf_rele+0x195
> dbuf.c:1308 dbuf_clear+0x7f
> dbuf.c:233 dbuf_evict+0xd
> dbuf.c:1453 dbuf_do_evict+0x44
> arc.c:1314 arc_do_user_evicts+0x51
> arc.c:1537 arc_reclaim_thread+0x36b
>
> (In this dbuf.c:1710/1713 I'm not sure which line exactly it is, but it
> doesn't matter from what I see.)
>
> The most important part is dbuf.c:1308, which calls dbuf_rele() on the
> parent dmu_buf_impl_t, which is already locked by the second thread, so
> when we lock it in arc_buf_remove_ref() we deadlock, because the lock
> is held by the thread waiting for memory.
>
> Does my description make sense? Do you have any suggestions how to fix it?
>
> --
> Pawel Jakub Dawidek http://www.wheel.pl
> pjd at FreeBSD.org http://www.FreeBSD.org
> FreeBSD committer Am I Evil? Yes, I Am!
>
> ------------------------------------------------------------------------
> Part 1.1.2Type: application/pgp-signature
>
> Part 1.2 Type: Plain Text (text/plain)
> Encoding: 7bit