I'm moving this particular part of the threadpool follow-up discussion to 
tech-kern.

> On Dec 26, 2018, at 10:04 AM, Taylor R Campbell 
> <[email protected]> wrote:
> 
>> +bool
>> +threadpool_cancel_job_async(threadpool_t *pool, threadpool_job_t *ext_job)
>> +{
>> +    threadpool_job_impl_t *job = THREADPOOL_JOB_TO_IMPL(ext_job);
>> +
>> +    KASSERT(mutex_owned(job->job_lock));
>> +
>> +    /*
>> +     * XXXJRT This fails (albeit safely) when all of the following
>> +     * are true:
>> +     *
>> +     *      => "pool" is something other than what the job was
>> +     *         scheduled on.  This can legitimately occur if,
>> +     *         for example, a job is percpu-scheduled on CPU0
>> +     *         and then CPU1 attempts to cancel it without taking
>> +     *         a remote pool reference.  (this might happen by
>> +     *         "luck of the draw").
> 
> Under what circumstances can this happen?  This sounds like a bug that
> we should KASSERT into oblivion if there's any danger of it.


Consider this hypothetical...

Let's say you have MyNiftyApplication that from time to time wants to do some 
long-running chore that thread pools are well-suited for.  For its own reasons, 
it uses a percpu thread pool because the application designer determined that 
at the time the decision is made to run the job, it should run on the CPU that 
did the deciding, although for the purposes of deciding to schedule the job 
again while the job is already in-progress, it is not critical that this is the 
case.

Now let's way MyNiftyApplication is exiting, and so it wants to ensure it's 
long-running chore is stopped.  So it calls threadpool_cancel_job().  Now, 
which individual pool in the percpu cadre should be passed to 
threadpool_cancel_job()?  The job itself doesn't directly track which pool it 
was last successfully scheduled on, and because thredpool_schedule_job() 
doesn't return any indication that the job was already scheduled, 
MyNiftyApplication can't reliably track it either.

So, threadpool_cancel_job_async() does a best-effort... If it knows the job has 
no assigned thread, ta-da, cancelled!  If it's assigned thread is the specified 
pool's overseer, ta-da, cancelled!  But that it can't tell is if the assigned 
thread is some *other* pool in the cadre's overseer, and so it falls into the 
already-running-must-wait case.

I don't see how this is particularly harmful, as it will result in callers 
simply having to wait a little longer in a case there they might not otherwise 
have had to.  Job cancellation is inherently racy, which is why we sometimes 
have to tell callers to wait.  Considering this edge case still results in safe 
behavior, I don't think it's worth throwing extra bookkeeping at the problem.

-- thorpej

Reply via email to