Re: [Qemu-devel] [PATCH 02/21] jobs: add exit shim

2018-08-08 Thread Kevin Wolf
Am 08.08.2018 um 17:38 hat John Snow geschrieben:
> On 08/08/2018 11:23 AM, Kevin Wolf wrote:
> > Am 08.08.2018 um 06:02 hat Jeff Cody geschrieben:
> >> On Tue, Aug 07, 2018 at 12:33:30AM -0400, John Snow wrote:
> >>> Most jobs do the same thing when they leave their running loop:
> >>> - Store the return code in a structure
> >>> - wait to receive this structure in the main thread
> >>> - signal job completion via job_completed
> >>>
> >>> More seriously, when we utilize job_defer_to_main_loop_bh to call
> >>> a function that calls job_completed, job_finalize_single will run
> >>> in a context where it has recursively taken the aio_context lock,
> >>> which can cause hangs if it puts down a reference that causes a flush.
> >>>
> >>> The job infrastructure is perfectly capable of registering job
> >>> completion itself when we leave the job's entry point. In this
> >>> context, we can signal job completion from outside of the aio_context,
> >>> which should allow for job cleanup code to run with only one lock.
> >>>
> >>> Signed-off-by: John Snow 
> >>
> >> I like the simplification, both in SLOC and in exit logic (as seen in
> >> patches 3-7).
> > 
> > I agree, unifying this seems like a good idea.
> > 
> > Like in the first patch, I'm not convinced of the details, though.
> > Essentially, this is my objection regarding job->err extended to
> > job->ret: You rely on jobs setting job->ret and job->err, but the
> > interfaces don't really show this.
> > 
> >>> @@ -546,6 +559,12 @@ static void coroutine_fn job_co_entry(void *opaque)
> >>>  assert(job && job->driver && job->driver->start);
> >>>  job_pause_point(job);
> >>>  job->driver->start(job);
> >>
> >> One nit-picky observation here, that is unrelated to this patch: reading
> >> through, it may not be so obvious that 'start' is really a 'run' or
> >> 'execute', (linguistically, to me 'start' implies a kick-off rather than
> >> ongoing execution).
> > 
> > I had exactly the same thought. My proposal is to change the existing...
> > 
> > CoroutineEntry *start;
> > 
> > ...which is just short for...
> > 
> > void coroutine_fn start(void *opaque);
> > 
> > ...into this one:
> > 
> > int coroutine_fn run(void *opaque, Error **errp);
> > 
> > I see that at the end of the series, you actually introduced an int
> > return value already. I would have done that from the start, but as long
> > the final state makes sense, I won't insist.
> > 
> > But can we have the Error **errp addition, too? Pretty please?
> > 
> > Kevin
> > 
> 
> I'm actually glad you want that addition, I was considering very
> strongly adding it but I felt like I had made the series long enough
> already and didn't want to change too much all at once.
> 
> The basic thought was just:
> 
> "It'd sure be nice to have a generic function entry point that looks
> like it returns the same error information as our non-coroutine functions."
> 
> I can absolutely work that in, and break this series into two parts:
> 
> (1) Rework jobs infrastructure to use the new run signature, and
> (2) Rework jobs to use the finalization callbacks.
> 
> Sound good?

I haven't looked at the rest of the series yet, but so far this sounds
good to me.

Kevin



Re: [Qemu-devel] [PATCH 02/21] jobs: add exit shim

2018-08-08 Thread John Snow



On 08/08/2018 11:23 AM, Kevin Wolf wrote:
> Am 08.08.2018 um 06:02 hat Jeff Cody geschrieben:
>> On Tue, Aug 07, 2018 at 12:33:30AM -0400, John Snow wrote:
>>> Most jobs do the same thing when they leave their running loop:
>>> - Store the return code in a structure
>>> - wait to receive this structure in the main thread
>>> - signal job completion via job_completed
>>>
>>> More seriously, when we utilize job_defer_to_main_loop_bh to call
>>> a function that calls job_completed, job_finalize_single will run
>>> in a context where it has recursively taken the aio_context lock,
>>> which can cause hangs if it puts down a reference that causes a flush.
>>>
>>> The job infrastructure is perfectly capable of registering job
>>> completion itself when we leave the job's entry point. In this
>>> context, we can signal job completion from outside of the aio_context,
>>> which should allow for job cleanup code to run with only one lock.
>>>
>>> Signed-off-by: John Snow 
>>
>> I like the simplification, both in SLOC and in exit logic (as seen in
>> patches 3-7).
> 
> I agree, unifying this seems like a good idea.
> 
> Like in the first patch, I'm not convinced of the details, though.
> Essentially, this is my objection regarding job->err extended to
> job->ret: You rely on jobs setting job->ret and job->err, but the
> interfaces don't really show this.
> 
>>> @@ -546,6 +559,12 @@ static void coroutine_fn job_co_entry(void *opaque)
>>>  assert(job && job->driver && job->driver->start);
>>>  job_pause_point(job);
>>>  job->driver->start(job);
>>
>> One nit-picky observation here, that is unrelated to this patch: reading
>> through, it may not be so obvious that 'start' is really a 'run' or
>> 'execute', (linguistically, to me 'start' implies a kick-off rather than
>> ongoing execution).
> 
> I had exactly the same thought. My proposal is to change the existing...
> 
> CoroutineEntry *start;
> 
> ...which is just short for...
> 
> void coroutine_fn start(void *opaque);
> 
> ...into this one:
> 
> int coroutine_fn run(void *opaque, Error **errp);
> 
> I see that at the end of the series, you actually introduced an int
> return value already. I would have done that from the start, but as long
> the final state makes sense, I won't insist.
> 
> But can we have the Error **errp addition, too? Pretty please?
> 
> Kevin
> 

I'm actually glad you want that addition, I was considering very
strongly adding it but I felt like I had made the series long enough
already and didn't want to change too much all at once.

The basic thought was just:

"It'd sure be nice to have a generic function entry point that looks
like it returns the same error information as our non-coroutine functions."

I can absolutely work that in, and break this series into two parts:

(1) Rework jobs infrastructure to use the new run signature, and
(2) Rework jobs to use the finalization callbacks.

Sound good?

--js



Re: [Qemu-devel] [PATCH 02/21] jobs: add exit shim

2018-08-08 Thread Kevin Wolf
Am 08.08.2018 um 06:02 hat Jeff Cody geschrieben:
> On Tue, Aug 07, 2018 at 12:33:30AM -0400, John Snow wrote:
> > Most jobs do the same thing when they leave their running loop:
> > - Store the return code in a structure
> > - wait to receive this structure in the main thread
> > - signal job completion via job_completed
> > 
> > More seriously, when we utilize job_defer_to_main_loop_bh to call
> > a function that calls job_completed, job_finalize_single will run
> > in a context where it has recursively taken the aio_context lock,
> > which can cause hangs if it puts down a reference that causes a flush.
> > 
> > The job infrastructure is perfectly capable of registering job
> > completion itself when we leave the job's entry point. In this
> > context, we can signal job completion from outside of the aio_context,
> > which should allow for job cleanup code to run with only one lock.
> > 
> > Signed-off-by: John Snow 
> 
> I like the simplification, both in SLOC and in exit logic (as seen in
> patches 3-7).

I agree, unifying this seems like a good idea.

Like in the first patch, I'm not convinced of the details, though.
Essentially, this is my objection regarding job->err extended to
job->ret: You rely on jobs setting job->ret and job->err, but the
interfaces don't really show this.

> > @@ -546,6 +559,12 @@ static void coroutine_fn job_co_entry(void *opaque)
> >  assert(job && job->driver && job->driver->start);
> >  job_pause_point(job);
> >  job->driver->start(job);
> 
> One nit-picky observation here, that is unrelated to this patch: reading
> through, it may not be so obvious that 'start' is really a 'run' or
> 'execute', (linguistically, to me 'start' implies a kick-off rather than
> ongoing execution).

I had exactly the same thought. My proposal is to change the existing...

CoroutineEntry *start;

...which is just short for...

void coroutine_fn start(void *opaque);

...into this one:

int coroutine_fn run(void *opaque, Error **errp);

I see that at the end of the series, you actually introduced an int
return value already. I would have done that from the start, but as long
the final state makes sense, I won't insist.

But can we have the Error **errp addition, too? Pretty please?

Kevin



Re: [Qemu-devel] [PATCH 02/21] jobs: add exit shim

2018-08-08 Thread John Snow



On 08/08/2018 12:02 AM, Jeff Cody wrote:
> On Tue, Aug 07, 2018 at 12:33:30AM -0400, John Snow wrote:
>> Most jobs do the same thing when they leave their running loop:
>> - Store the return code in a structure
>> - wait to receive this structure in the main thread
>> - signal job completion via job_completed
>>
>> More seriously, when we utilize job_defer_to_main_loop_bh to call
>> a function that calls job_completed, job_finalize_single will run
>> in a context where it has recursively taken the aio_context lock,
>> which can cause hangs if it puts down a reference that causes a flush.
>>
>> The job infrastructure is perfectly capable of registering job
>> completion itself when we leave the job's entry point. In this
>> context, we can signal job completion from outside of the aio_context,
>> which should allow for job cleanup code to run with only one lock.
>>
>> Signed-off-by: John Snow 
> 
> I like the simplification, both in SLOC and in exit logic (as seen in
> patches 3-7).
> 
>> ---
>>  include/qemu/job.h |  7 +++
>>  job.c  | 19 +++
>>  2 files changed, 26 insertions(+)
>>
>> diff --git a/include/qemu/job.h b/include/qemu/job.h
>> index 845ad00c03..0c24e8704f 100644
>> --- a/include/qemu/job.h
>> +++ b/include/qemu/job.h
>> @@ -204,6 +204,13 @@ struct JobDriver {
>>   */
>>  void (*drain)(Job *job);
>>  
>> +/**
>> + * If the callback is not NULL, exit will be invoked from the main 
>> thread
>> + * when the job's coroutine has finished, but before transactional
>> + * convergence; before @prepare or @abort.
>> + */
>> +void (*exit)(Job *job);
>> +
>>  /**
>>   * If the callback is not NULL, prepare will be invoked when all the 
>> jobs
>>   * belonging to the same transaction complete; or upon this job's 
>> completion
>> diff --git a/job.c b/job.c
>> index b281f30375..cc5ac9ac30 100644
>> --- a/job.c
>> +++ b/job.c
>> @@ -535,6 +535,19 @@ void job_drain(Job *job)
>>  }
>>  }
>>  
>> +static void job_exit(void *opaque)
>> +{
>> +Job *job = (Job *)opaque;
>> +AioContext *aio_context = job->aio_context;
>> +
>> +if (job->driver->exit) {
>> +aio_context_acquire(aio_context);
>> +job->driver->exit(job);
>> +aio_context_release(aio_context);
>> +}
>> +job_completed(job, job->ret);
>> +}
>> +
>>  /**
>>   * All jobs must allow a pause point before entering their job proper. This
>>   * ensures that jobs can be paused prior to being started, then resumed 
>> later.
>> @@ -546,6 +559,12 @@ static void coroutine_fn job_co_entry(void *opaque)
>>  assert(job && job->driver && job->driver->start);
>>  job_pause_point(job);
>>  job->driver->start(job);
> 
> One nit-picky observation here, that is unrelated to this patch: reading
> through, it may not be so obvious that 'start' is really a 'run' or
> 'execute', (linguistically, to me 'start' implies a kick-off rather than
> ongoing execution).
> 
> Just some bike-shedding again, though, and not even for this patch.  So
> nothing to do here :)
> 
> Reviewed-by: Jeff Cody 

I agree with you and thought the same as I was going through these
changes. I think with patch 12 in particular where I really solidify the
idea that this is the main execution loop for the coroutine makes it
obvious that it should be named .run or similar.

Also, while we're bikeshedding naming, it's weird that the flow here is:

.run/.start
.exit
.prepare
.commit
.abort
.clean

...I might want to emphasize a few points here:

(A) exit occurs prior to "finalization" and should not modify the graph
(B) "prepare" really means "prepare to finalize"

I just don't always have good terminology.

> 
> 
>> +if (!job->deferred_to_main_loop) {
>> +job->deferred_to_main_loop = true;
>> +aio_bh_schedule_oneshot(qemu_get_aio_context(),
>> +job_exit,
>> +job);
>> +}
>>  }
>>  
>>  
>> -- 
>> 2.14.4
>>
> 



Re: [Qemu-devel] [PATCH 02/21] jobs: add exit shim

2018-08-07 Thread Jeff Cody
On Tue, Aug 07, 2018 at 12:33:30AM -0400, John Snow wrote:
> Most jobs do the same thing when they leave their running loop:
> - Store the return code in a structure
> - wait to receive this structure in the main thread
> - signal job completion via job_completed
> 
> More seriously, when we utilize job_defer_to_main_loop_bh to call
> a function that calls job_completed, job_finalize_single will run
> in a context where it has recursively taken the aio_context lock,
> which can cause hangs if it puts down a reference that causes a flush.
> 
> The job infrastructure is perfectly capable of registering job
> completion itself when we leave the job's entry point. In this
> context, we can signal job completion from outside of the aio_context,
> which should allow for job cleanup code to run with only one lock.
> 
> Signed-off-by: John Snow 

I like the simplification, both in SLOC and in exit logic (as seen in
patches 3-7).

> ---
>  include/qemu/job.h |  7 +++
>  job.c  | 19 +++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/qemu/job.h b/include/qemu/job.h
> index 845ad00c03..0c24e8704f 100644
> --- a/include/qemu/job.h
> +++ b/include/qemu/job.h
> @@ -204,6 +204,13 @@ struct JobDriver {
>   */
>  void (*drain)(Job *job);
>  
> +/**
> + * If the callback is not NULL, exit will be invoked from the main thread
> + * when the job's coroutine has finished, but before transactional
> + * convergence; before @prepare or @abort.
> + */
> +void (*exit)(Job *job);
> +
>  /**
>   * If the callback is not NULL, prepare will be invoked when all the jobs
>   * belonging to the same transaction complete; or upon this job's 
> completion
> diff --git a/job.c b/job.c
> index b281f30375..cc5ac9ac30 100644
> --- a/job.c
> +++ b/job.c
> @@ -535,6 +535,19 @@ void job_drain(Job *job)
>  }
>  }
>  
> +static void job_exit(void *opaque)
> +{
> +Job *job = (Job *)opaque;
> +AioContext *aio_context = job->aio_context;
> +
> +if (job->driver->exit) {
> +aio_context_acquire(aio_context);
> +job->driver->exit(job);
> +aio_context_release(aio_context);
> +}
> +job_completed(job, job->ret);
> +}
> +
>  /**
>   * All jobs must allow a pause point before entering their job proper. This
>   * ensures that jobs can be paused prior to being started, then resumed 
> later.
> @@ -546,6 +559,12 @@ static void coroutine_fn job_co_entry(void *opaque)
>  assert(job && job->driver && job->driver->start);
>  job_pause_point(job);
>  job->driver->start(job);

One nit-picky observation here, that is unrelated to this patch: reading
through, it may not be so obvious that 'start' is really a 'run' or
'execute', (linguistically, to me 'start' implies a kick-off rather than
ongoing execution).

Just some bike-shedding again, though, and not even for this patch.  So
nothing to do here :)

Reviewed-by: Jeff Cody 


> +if (!job->deferred_to_main_loop) {
> +job->deferred_to_main_loop = true;
> +aio_bh_schedule_oneshot(qemu_get_aio_context(),
> +job_exit,
> +job);
> +}
>  }
>  
>  
> -- 
> 2.14.4
> 



[Qemu-devel] [PATCH 02/21] jobs: add exit shim

2018-08-06 Thread John Snow
Most jobs do the same thing when they leave their running loop:
- Store the return code in a structure
- wait to receive this structure in the main thread
- signal job completion via job_completed

More seriously, when we utilize job_defer_to_main_loop_bh to call
a function that calls job_completed, job_finalize_single will run
in a context where it has recursively taken the aio_context lock,
which can cause hangs if it puts down a reference that causes a flush.

The job infrastructure is perfectly capable of registering job
completion itself when we leave the job's entry point. In this
context, we can signal job completion from outside of the aio_context,
which should allow for job cleanup code to run with only one lock.

Signed-off-by: John Snow 
---
 include/qemu/job.h |  7 +++
 job.c  | 19 +++
 2 files changed, 26 insertions(+)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 845ad00c03..0c24e8704f 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -204,6 +204,13 @@ struct JobDriver {
  */
 void (*drain)(Job *job);
 
+/**
+ * If the callback is not NULL, exit will be invoked from the main thread
+ * when the job's coroutine has finished, but before transactional
+ * convergence; before @prepare or @abort.
+ */
+void (*exit)(Job *job);
+
 /**
  * If the callback is not NULL, prepare will be invoked when all the jobs
  * belonging to the same transaction complete; or upon this job's 
completion
diff --git a/job.c b/job.c
index b281f30375..cc5ac9ac30 100644
--- a/job.c
+++ b/job.c
@@ -535,6 +535,19 @@ void job_drain(Job *job)
 }
 }
 
+static void job_exit(void *opaque)
+{
+Job *job = (Job *)opaque;
+AioContext *aio_context = job->aio_context;
+
+if (job->driver->exit) {
+aio_context_acquire(aio_context);
+job->driver->exit(job);
+aio_context_release(aio_context);
+}
+job_completed(job, job->ret);
+}
+
 /**
  * All jobs must allow a pause point before entering their job proper. This
  * ensures that jobs can be paused prior to being started, then resumed later.
@@ -546,6 +559,12 @@ static void coroutine_fn job_co_entry(void *opaque)
 assert(job && job->driver && job->driver->start);
 job_pause_point(job);
 job->driver->start(job);
+if (!job->deferred_to_main_loop) {
+job->deferred_to_main_loop = true;
+aio_bh_schedule_oneshot(qemu_get_aio_context(),
+job_exit,
+job);
+}
 }
 
 
-- 
2.14.4