> On Dec 26, 2018, at 4:11 PM, Taylor R Campbell > <campbell+netbsd-source-change...@mumble.net> wrote: > > Job reference counting is currently slightly busted -- my draft for > threadpool_schedule_job didn't increment it correctly, and your > threadpool_schedule_job doesn't handle failure in threadpool_job_hold.
Ok, I looked at this some last night, and was up early this morning and took a peek at it again, and I think getting rid of the refcnt is going to be tough. The rub is the special handling of "assigned to overseer" in cancel_async. Because of that, you can't really use job_thread as a proxy for "held", and there's a small window in the overseer thread where neither the pool nor the job are locked because of locking order (which is job -> pool), yet the job is referenced; this happens when there is a thread to give the job to, we pull the job off the overseer queue, but we can't lock the job until we drop the tp_lock -- it's during that window where we have phantom reference that needs to be accounted for. So, here's how I think it can be fixed: -- Must continue to use atomics for job_hold / job_rele. -- job_hold remains lockless, but job_rele can only be called **without** the job_lock held, because it needs to acquire the lock in the unlikely case it performs a cv_broadcast (see below). Also need a job_rele_locked. -- In schedule_job, always job_hold for either case (overseer or direct hand-off to idle thread). This is the job's lifecycle reference, and will be dropped either in cancel_async or in job_done. -- In cancel_async, only job_rele_locked after successfully removing it from the overseer queue (protected by tp_lock). Note the job_lock is held by the caller of cancel_async. -- In overseer thread, after grabbing a local reference to the job, do a job_hold before dropping the tp_lock, to prevent racing with cancel_async. This is a temporary hold while we do our work, and will always be released, regardless of job being cancelled beneath us or given to a thread. Note the most likely scenario is that the job was NOT cancelled beneath us, and thus we will not be dropping the last reference, and therefore we don't want to take the job_lock again in this common case just to drop our temporary reference. -- job_done should assert there is always at least 1 reference, and should directly decrement it rather than doing job_rele, because it already has the job_lock held and always does the cv_broadcast anyway. -- Don't need to handle any overflow in job_hold / job_rele ... there reference count should ever one be 0 (totally idle), 1 (scheduled/running OR phantom reference), or 2 (scheduled AND phantom reference). We can assert no overflow in job_hold. -- thorpej