Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-13 Thread Kevin Wolf
Am 13.04.2018 um 13:05 hat Paolo Bonzini geschrieben:
> On 13/04/2018 10:01, Kevin Wolf wrote:
> >> Or bs->quiescent, for the sake of bikeshedding.
> > Yes, that sounds better.
> > 
> > The only problem with the proposal as I made it is that it's wrong. We
> > can't keep bs->quiescent until bdrv_do_drained_end() because the caller
> > can issue new requests and then have a nested drained section that needs
> > to wait for all requests again instead of deciding that everything is
> > already quiescent.
> > 
> > Maybe where we should really reset it is in the initial recursion of
> > bdrv_do_drained_begin(), specifically in bdrv_do_drained_begin_quiesce()
> > which is called by both the parent and the child recursion.
> > 
> > There don't seem to be completely obviously correct solutions (can't an
> > I/O thread be draining a specific node while the main loop runs
> > drain_all?), but this would probably be the most obvious one.
> 
> Or use a hash table?

I don't think it would be any more obvious, but it would bring in quite
a bit of additional complexity (structural, not computational), with
multiple places that create a hash table and then it would have to be
passed down to all functions involved in the recursion etc.

The fundamental question would stay the same as with bool quiescent:
When do you have to enter a node in the hash table, and when do you have
to remove it again?

The first question is easy, you mark it quiescent when bdrv_drain_poll()
returns false. The second is a bit harder, but reseting the quiescent
state in bdrv_do_drained_begin_quiesce() feels like the best place. It's
much simpler than recursively resetting it in all places that start new
activity (which would include BlockBackend users, not only nodes).

Kevin



Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-13 Thread Paolo Bonzini
On 13/04/2018 10:01, Kevin Wolf wrote:
>> Or bs->quiescent, for the sake of bikeshedding.
> Yes, that sounds better.
> 
> The only problem with the proposal as I made it is that it's wrong. We
> can't keep bs->quiescent until bdrv_do_drained_end() because the caller
> can issue new requests and then have a nested drained section that needs
> to wait for all requests again instead of deciding that everything is
> already quiescent.
> 
> Maybe where we should really reset it is in the initial recursion of
> bdrv_do_drained_begin(), specifically in bdrv_do_drained_begin_quiesce()
> which is called by both the parent and the child recursion.
> 
> There don't seem to be completely obviously correct solutions (can't an
> I/O thread be draining a specific node while the main loop runs
> drain_all?), but this would probably be the most obvious one.

Or use a hash table?

Paolo



Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-13 Thread Kevin Wolf
Am 12.04.2018 um 22:44 hat Paolo Bonzini geschrieben:
> On 12/04/2018 16:25, Kevin Wolf wrote:
> > This is already the order we have there. What is probably different from
> > what you envision is that after the parents have concluded, we still
> > check that they are still quiescent in every iteration.
> 
> Yes, and that's the quadratic part.
> 
> > What we could do easily is introducing a bool bs->quiesce_concluded or
> > something that bdrv_drain_poll() sets to true the first time that it
> > returns false. It also gets a shortcut so that it returns false
> > immediately if bs->quiesce_concluded is true. The field is reset in
> > bdrv_do_drained_end() when bs->quiesce_counter reaches 0.
> 
> Or bs->quiescent, for the sake of bikeshedding.

Yes, that sounds better.

The only problem with the proposal as I made it is that it's wrong. We
can't keep bs->quiescent until bdrv_do_drained_end() because the caller
can issue new requests and then have a nested drained section that needs
to wait for all requests again instead of deciding that everything is
already quiescent.

Maybe where we should really reset it is in the initial recursion of
bdrv_do_drained_begin(), specifically in bdrv_do_drained_begin_quiesce()
which is called by both the parent and the child recursion.

There don't seem to be completely obviously correct solutions (can't an
I/O thread be draining a specific node while the main loop runs
drain_all?), but this would probably be the most obvious one.

Hm... Or actually, reset bs->quiescent in bdrv_inc_in_flight()? Would
this be enough?

Kevin



Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-12 Thread Paolo Bonzini
On 12/04/2018 16:25, Kevin Wolf wrote:
> This is already the order we have there. What is probably different from
> what you envision is that after the parents have concluded, we still
> check that they are still quiescent in every iteration.

Yes, and that's the quadratic part.

> What we could do easily is introducing a bool bs->quiesce_concluded or
> something that bdrv_drain_poll() sets to true the first time that it
> returns false. It also gets a shortcut so that it returns false
> immediately if bs->quiesce_concluded is true. The field is reset in
> bdrv_do_drained_end() when bs->quiesce_counter reaches 0.

Or bs->quiescent, for the sake of bikeshedding.

> That would be a rather simple optimisation that could be done as the
> final patch in this series, and would ensure that we don't recurse back
> to parents that are already quiescent.
> 
> Would you be happy with this change?

Yes, I'll leave the organization of the series to you of course.

Paolo



Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-12 Thread Kevin Wolf
Am 12.04.2018 um 15:42 hat Paolo Bonzini geschrieben:
> On 12/04/2018 15:27, Kevin Wolf wrote:
> > Not sure I follow. Let's look at an example. Say, we have a block job
> > BlockBackend as the root (because that uses proper layering, unlike
> > devices which use aio_disable_external()), connected to a qcow2 node
> > over file.
> > 
> > 1. The block job issues a request to the qcow2 node
> > 
> > 2. In order to process that request, the qcow2 node internally issues
> >one or more requests to the file node
> > 
> > 3. Someone calls bdrv_drained_begin(qcow2_node)
> > 
> > 4. We call BlockDriver.bdrv_drained_begin() for qcow2_node and
> >file_node, and BdrvChildRole.drained_begin() for the block job.
> > 
> >Now the block nodes don't create any new original requests any more
> >(qcow2 and file don't do that anyway; qcow2 only creates requests to
> >its children in the context of parent requests). The job potentially
> >continues sending requests until it reaches the next pause point. The
> >previously issued requests are still in flight.
> >
> >Is this step what you meant by X->drv->bdrv_drain(X)? I don't see why
> >pending requests can only be in X's children. Why can't the request
> >be pending in X itself, say waiting for a thread pool worker
> >decrypting a buffer?
> 
> No, that step is ->bdrv_co_drain_begin in BlockDriver.  It's where the
> "last" requests are sent to file_node after we know that qcow2_node
> won't get any more requests.

That's not really how .bdrv_co_drain_begin works. It is called first,
potentially long before we know that the parent won't send new requests
any more. All it does is preventing the node from creating new original
requests, i.e. internal requests that are not triggered by a parent
request.

> >Also, note that after this series, the driver callbacks are called
> >asynchronously, but I don't think it makes a big difference here.
> > 
> > 5. The file_node request completes. file_node doesn't have any requests
> >in flight any more, but in theory it could still get new requests
> >from qcow2_node. Anyway, let's say this is the last request, so I
> >think we'd call its requests concluded?
> 
> No, if it can still get more requests they're not concluded.

Okay. In that case, we can't really determine any order, because the
whole subtree concludes when the root node concludes. (Ignoring any
additional parents, but the effect is the same: Children always conclude
at the same time as their last parent.)

> That's why we need to first ensure qcow2_node is quiescent, and before
> then we need to ensure that the BlockBackends are quiescent (in this
> case meaning the job has reached its pause point).  Only then we can
> look at file_node.  In this case we'll see that we have nothing to
> do---file_node is already quiescent---and bdrv_drained_begin() can
> return.

bdrv_drained_begin(qcow2_node) actually never looks at file_node. Only
bdrv_subtree_drained_begin(qcow2_node) would do so.

But what you're saying is essentially that for a subtree drain, we want
the following order in bdrv_drained_poll():

1. Check if a parent still has pending requests. If so, don't bother
   checking the rest.

2. Check the node the drain request was for. If so, don't bother looking
   at the children.

3. Finally, if all parents and the node itself are drained, look at the
   children.

This is already the order we have there. What is probably different from
what you envision is that after the parents have concluded, we still
check that they are still quiescent in every iteration.

What we could do easily is introducing a bool bs->quiesce_concluded or
something that bdrv_drain_poll() sets to true the first time that it
returns false. It also gets a shortcut so that it returns false
immediately if bs->quiesce_concluded is true. The field is reset in
bdrv_do_drained_end() when bs->quiesce_counter reaches 0.

That would be a rather simple optimisation that could be done as the
final patch in this series, and would ensure that we don't recurse back
to parents that are already quiescent.

Would you be happy with this change?

Kevin



Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-12 Thread Paolo Bonzini
On 12/04/2018 15:27, Kevin Wolf wrote:
> Not sure I follow. Let's look at an example. Say, we have a block job
> BlockBackend as the root (because that uses proper layering, unlike
> devices which use aio_disable_external()), connected to a qcow2 node
> over file.
> 
> 1. The block job issues a request to the qcow2 node
> 
> 2. In order to process that request, the qcow2 node internally issues
>one or more requests to the file node
> 
> 3. Someone calls bdrv_drained_begin(qcow2_node)
> 
> 4. We call BlockDriver.bdrv_drained_begin() for qcow2_node and
>file_node, and BdrvChildRole.drained_begin() for the block job.
> 
>Now the block nodes don't create any new original requests any more
>(qcow2 and file don't do that anyway; qcow2 only creates requests to
>its children in the context of parent requests). The job potentially
>continues sending requests until it reaches the next pause point. The
>previously issued requests are still in flight.
>
>Is this step what you meant by X->drv->bdrv_drain(X)? I don't see why
>pending requests can only be in X's children. Why can't the request
>be pending in X itself, say waiting for a thread pool worker
>decrypting a buffer?

No, that step is ->bdrv_co_drain_begin in BlockDriver.  It's where the
"last" requests are sent to file_node after we know that qcow2_node
won't get any more requests.

>Also, note that after this series, the driver callbacks are called
>asynchronously, but I don't think it makes a big difference here.
> 
> 5. The file_node request completes. file_node doesn't have any requests
>in flight any more, but in theory it could still get new requests
>from qcow2_node. Anyway, let's say this is the last request, so I
>think we'd call its requests concluded?

No, if it can still get more requests they're not concluded.  That's why
we need to first ensure qcow2_node is quiescent, and before then we need
to ensure that the BlockBackends are quiescent (in this case meaning the
job has reached its pause point).  Only then we can look at file_node.
In this case we'll see that we have nothing to do---file_node is already
quiescent---and bdrv_drained_begin() can return.

> 6. qcow2 still has a request in flight, but doesn't need to access
>file_node for it. It finishes the work and therefore concludes its
>requests as well. Note that qcow2_node (the parent) concludes after
>file_node (the child).
> 
> 7. We'll keep the example simple, so after completion of its request,
>the job reaches a pause point without sending a new request. Again,
>this happens after qcow2_node has concluded.
> 
> 8. Only when neither file_node nor qcow2_node have a request in flight
>and the job has reached a pause point, bdrv_drained_begin() can
>return.
> 
> So completing the last request and reaching an actually quiescent state
> looks very much like a process in child-to-parent order to me?




Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-12 Thread Kevin Wolf
Am 12.04.2018 um 14:02 hat Paolo Bonzini geschrieben:
> On 12/04/2018 13:53, Kevin Wolf wrote:
> >> The problem I have is that there is a direction through which I/O flows
> >> (parent-to-child), so why can't draining follow that natural direction.
> >> Having to check for the parents' I/O, while draining the child, seems
> >> wrong.  Perhaps we can't help it, but I cannot understand the reason.
> > I'm not sure what's there that could be not understood. You already
> > confirmed that we need to drain the parents, too, when we drain a node.
> > Drain really must propagate in the opposite direction of I/O, because
> > part of its job is to quiesce the origin of any I/O to the node that
> > should be drained. Opposite of I/O _is_ the natural direction for drain.
> 
> Opposite of I/O is the natural direction for drain to propagate, yes.
> 
> However, I/O direction is the natural direction for requests to stop.
> After quiescing X and calling X->drv->bdrv_drain(X), there can be
> pending requests only in X's children.  So I don't understand why you
> need to keep checking in_flight over the whole subgraph, when there are
> roots that will conclude their request first, and then their children,
> and so on so forth.

Not sure I follow. Let's look at an example. Say, we have a block job
BlockBackend as the root (because that uses proper layering, unlike
devices which use aio_disable_external()), connected to a qcow2 node
over file.

1. The block job issues a request to the qcow2 node

2. In order to process that request, the qcow2 node internally issues
   one or more requests to the file node

3. Someone calls bdrv_drained_begin(qcow2_node)

4. We call BlockDriver.bdrv_drained_begin() for qcow2_node and
   file_node, and BdrvChildRole.drained_begin() for the block job.

   Now the block nodes don't create any new original requests any more
   (qcow2 and file don't do that anyway; qcow2 only creates requests to
   its children in the context of parent requests). The job potentially
   continues sending requests until it reaches the next pause point. The
   previously issued requests are still in flight.

   Is this step what you meant by X->drv->bdrv_drain(X)? I don't see why
   pending requests can only be in X's children. Why can't the request
   be pending in X itself, say waiting for a thread pool worker
   decrypting a buffer?

   Also, note that after this series, the driver callbacks are called
   asynchronously, but I don't think it makes a big difference here.

5. The file_node request completes. file_node doesn't have any requests
   in flight any more, but in theory it could still get new requests
   from qcow2_node. Anyway, let's say this is the last request, so I
   think we'd call its requests concluded?

6. qcow2 still has a request in flight, but doesn't need to access
   file_node for it. It finishes the work and therefore concludes its
   requests as well. Note that qcow2_node (the parent) concludes after
   file_node (the child).

7. We'll keep the example simple, so after completion of its request,
   the job reaches a pause point without sending a new request. Again,
   this happens after qcow2_node has concluded.

8. Only when neither file_node nor qcow2_node have a request in flight
   and the job has reached a pause point, bdrv_drained_begin() can
   return.

So completing the last request and reaching an actually quiescent state
looks very much like a process in child-to-parent order to me?

Kevin



Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-12 Thread Paolo Bonzini
On 12/04/2018 13:53, Kevin Wolf wrote:
>> The problem I have is that there is a direction through which I/O flows
>> (parent-to-child), so why can't draining follow that natural direction.
>> Having to check for the parents' I/O, while draining the child, seems
>> wrong.  Perhaps we can't help it, but I cannot understand the reason.
> I'm not sure what's there that could be not understood. You already
> confirmed that we need to drain the parents, too, when we drain a node.
> Drain really must propagate in the opposite direction of I/O, because
> part of its job is to quiesce the origin of any I/O to the node that
> should be drained. Opposite of I/O _is_ the natural direction for drain.

Opposite of I/O is the natural direction for drain to propagate, yes.

However, I/O direction is the natural direction for requests to stop.
After quiescing X and calling X->drv->bdrv_drain(X), there can be
pending requests only in X's children.  So I don't understand why you
need to keep checking in_flight over the whole subgraph, when there are
roots that will conclude their request first, and then their children,
and so on so forth.

Thanks,

Paolo

> We also have subtree drains, but that's not because that's the natural
> direction for drain, but just as a convenience function because some
> operations (e.g. reopen) affect a whole subtree, so they need everything
> in that subtree drained rather than just a single node.




Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-12 Thread Kevin Wolf
Am 12.04.2018 um 13:30 hat Paolo Bonzini geschrieben:
> On 12/04/2018 13:11, Kevin Wolf wrote:
> >> Well, there is one gotcha: bdrv_ref protects against disappearance, but
> >> bdrv_ref/bdrv_unref are not thread-safe.  Am I missing something else?
> >
> > Apart from the above, if we do an extra bdrv_ref/unref we'd also have
> > to keep track of all the nodes that we've referenced so that we unref
> > the same nodes again, even if the graph has changes.
> >
> > So essentially you'd be introducing a new list of BDSes that we have to
> > manage and then check for every reachable node whether it's already in
> > that list or not, and for every node in the list whether it's still
> > reachable.
> 
> That would be a hash table (a set), not a list, so easy to check.  But
> the thread-safety is a bigger issue.
> 
> The problem I have is that there is a direction through which I/O flows
> (parent-to-child), so why can't draining follow that natural direction.
> Having to check for the parents' I/O, while draining the child, seems
> wrong.  Perhaps we can't help it, but I cannot understand the reason.

I'm not sure what's there that could be not understood. You already
confirmed that we need to drain the parents, too, when we drain a node.
Drain really must propagate in the opposite direction of I/O, because
part of its job is to quiesce the origin of any I/O to the node that
should be drained. Opposite of I/O _is_ the natural direction for drain.

We also have subtree drains, but that's not because that's the natural
direction for drain, but just as a convenience function because some
operations (e.g. reopen) affect a whole subtree, so they need everything
in that subtree drained rather than just a single node.

Kevin



Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-12 Thread Paolo Bonzini
On 12/04/2018 13:11, Kevin Wolf wrote:
>> Well, there is one gotcha: bdrv_ref protects against disappearance, but
>> bdrv_ref/bdrv_unref are not thread-safe.  Am I missing something else?
>
> Apart from the above, if we do an extra bdrv_ref/unref we'd also have
> to keep track of all the nodes that we've referenced so that we unref
> the same nodes again, even if the graph has changes.
>
> So essentially you'd be introducing a new list of BDSes that we have to
> manage and then check for every reachable node whether it's already in
> that list or not, and for every node in the list whether it's still
> reachable.

That would be a hash table (a set), not a list, so easy to check.  But
the thread-safety is a bigger issue.

The problem I have is that there is a direction through which I/O flows
(parent-to-child), so why can't draining follow that natural direction.
Having to check for the parents' I/O, while draining the child, seems
wrong.  Perhaps we can't help it, but I cannot understand the reason.

Paolo



Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-12 Thread Kevin Wolf
Am 12.04.2018 um 12:12 hat Paolo Bonzini geschrieben:
> On 12/04/2018 11:51, Kevin Wolf wrote:
> > Am 12.04.2018 um 10:37 hat Paolo Bonzini geschrieben:
> >> On 11/04/2018 18:39, Kevin Wolf wrote:
> >>> +bool bdrv_drain_poll(BlockDriverState *bs, bool top_level)
> >>>  {
> >>>  /* Execute pending BHs first and check everything else only after 
> >>> the BHs
> >>>   * have executed. */
> >>> -while (aio_poll(bs->aio_context, false));
> >>> +if (top_level) {
> >>> +while (aio_poll(bs->aio_context, false));
> >>> +}
> >>> +
> >>> +if (bdrv_parent_drained_poll(bs)) {
> >>> +return true;
> >>> +}
> >>> +
> >>>  return atomic_read(&bs->in_flight);
> >>>  }
> >>>  
> >>
> >> Up until now I liked very much this series, but I like this patch a bit
> >> less for two reasons.
> >>
> >> 1) I think I would prefer to have the !top_level case in a separate
> >> function---making the API easier to use in the BdrvChildRole callback
> >> because there is no need to pass false.
> > 
> > Basically just move the aio_poll() out to a different function that
> > calls bdrv_drain_poll afterwards? Maybe it's a bit cleaner, yes.
> > However, see below.
> 
> Yes.
> 
> >> In addition, the callback is not really polling anything, but rather
> >> returning whether the children are quiescent.  So a better name would
> >> be bdrv_children_drained and c->role->drained.
> > 
> > Why isn't it polling? We're actively checking the state of the node, and
> > we keep calling the callback until it has the expected state. Would it
> > only become polling for you if the loop were in this function rather
> > than the its caller?
> 
> It's just checking the status, it's not invoking the event loop.  The
> polling (in the sense of aio_poll or AIO_POLL_WHILE) is done elsewhere,
> this function is just the condition.  It's just nomenclature I guess.

Yes, it doesn't poll the AioContext events. I don't think that the name
implies that, though. It does poll the state of the block node and its
users.

> >> 2) Worse, the main idea behind the first drain restructuring was that
> >> draining could proceed in topological order: first drain the roots' I/O,
> >> then call bdrv_drain to send the last requests to their children, then
> >> recurse.  It is not clear to me why you need to introduce this recursive
> >> step, which is also O(n^2) in the worst case.
> > 
> > I need to introduce it because it fixes the bug that we don't wait until
> > the parents are actually quiesced and don't send new requests any more.
> > I don't see how this could be fixed without going to the parents.
> 
> Yes, you do need to go to the parents.  I don't understand however why
> you need more than a walk of the graph in parent-before-child order
> (which is a topological order, so it is reverse depth-first order and
> it's easy to do the walk in a recursive function).  If you're draining X
> below:
> 
>  A
>  |
>  B   C
>   \ /
>X
> 
> then you start by draining A/B/C in topological order (so A before B).
> If B happens to be already quiescent, you can skip not only B but A too.

Are we talking only about bdrv_drain_poll() here or the complete
bdrv_do_drained_begin() call?

If you don't want to recurse separately for .drain_begin and the polling
phase, but keep everything in one function like it's before this series,
you can't skip anything just because it's already quiescent. You don't
have control over that other drain and it might end too early; also,
.drain_begin/end must come in pairs.

If we're only talking about .drain_poll, then in theory you could skip A
if B knows that it's fully quiesced (i.e. not only .drain_begin was
called, but also .drain_poll has returned false at least once). We don't
store this information anywhere, though, and I'm not sure if the saved
recursion would be worth additional state.

>  If the nodes disappear or move elsewhere in the graph it's okay, you've
> just done useless work.

Not really. The node is drained now and bdrv_do_drained_end() won't end
the drained section because the moved node isn't reachable any more from
this original node.

> When you're done you ensure that every parent
> is quiescent, and if so you're done.  If it's not, , a new parent
> appeared---drain that too, using the same parent-before-child order, and
> loop.

That might work if we only ever were in a single drained section and
BlockDriverState had a bool quiesced. What we do have is a
quiesce_count, and when you look at the quiesce_count of a node, you
can't tell whether it came from you or someone else. So there is no way
to tell whether a parent was already drained by you or not.

And to be honest, even if it worked, it would still seem kind of
complicated to me. Just calling .drain_begin everywhere first and then
polling until nobody has anything in flight any more feels much simpler
conceptually than keeping track of which nodes we drained and repeating
that until no new nodes appear, and someh

Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-12 Thread Paolo Bonzini
On 12/04/2018 11:51, Kevin Wolf wrote:
> Am 12.04.2018 um 10:37 hat Paolo Bonzini geschrieben:
>> On 11/04/2018 18:39, Kevin Wolf wrote:
>>> +bool bdrv_drain_poll(BlockDriverState *bs, bool top_level)
>>>  {
>>>  /* Execute pending BHs first and check everything else only after the 
>>> BHs
>>>   * have executed. */
>>> -while (aio_poll(bs->aio_context, false));
>>> +if (top_level) {
>>> +while (aio_poll(bs->aio_context, false));
>>> +}
>>> +
>>> +if (bdrv_parent_drained_poll(bs)) {
>>> +return true;
>>> +}
>>> +
>>>  return atomic_read(&bs->in_flight);
>>>  }
>>>  
>>
>> Up until now I liked very much this series, but I like this patch a bit
>> less for two reasons.
>>
>> 1) I think I would prefer to have the !top_level case in a separate
>> function---making the API easier to use in the BdrvChildRole callback
>> because there is no need to pass false.
> 
> Basically just move the aio_poll() out to a different function that
> calls bdrv_drain_poll afterwards? Maybe it's a bit cleaner, yes.
> However, see below.

Yes.

>> In addition, the callback is not really polling anything, but rather
>> returning whether the children are quiescent.  So a better name would
>> be bdrv_children_drained and c->role->drained.
> 
> Why isn't it polling? We're actively checking the state of the node, and
> we keep calling the callback until it has the expected state. Would it
> only become polling for you if the loop were in this function rather
> than the its caller?

It's just checking the status, it's not invoking the event loop.  The
polling (in the sense of aio_poll or AIO_POLL_WHILE) is done elsewhere,
this function is just the condition.  It's just nomenclature I guess.

>> 2) Worse, the main idea behind the first drain restructuring was that
>> draining could proceed in topological order: first drain the roots' I/O,
>> then call bdrv_drain to send the last requests to their children, then
>> recurse.  It is not clear to me why you need to introduce this recursive
>> step, which is also O(n^2) in the worst case.
> 
> I need to introduce it because it fixes the bug that we don't wait until
> the parents are actually quiesced and don't send new requests any more.
> I don't see how this could be fixed without going to the parents.

Yes, you do need to go to the parents.  I don't understand however why
you need more than a walk of the graph in parent-before-child order
(which is a topological order, so it is reverse depth-first order and
it's easy to do the walk in a recursive function).  If you're draining X
below:

 A
 |
 B   C
  \ /
   X

then you start by draining A/B/C in topological order (so A before B).
If B happens to be already quiescent, you can skip not only B but A too.
 If the nodes disappear or move elsewhere in the graph it's okay, you've
just done useless work.  When you're done you ensure that every parent
is quiescent, and if so you're done.  If it's not, , a new parent
appeared---drain that too, using the same parent-before-child order, and
loop.

Well, there is one gotcha: bdrv_ref protects against disappearance, but
bdrv_ref/bdrv_unref are not thread-safe.  Am I missing something else?

> Is the O(n²) that you mean that we recursively iterate all children in
> bdrv_do_drained_begin() (or later in the series in bdrv_drain_poll()),
> and then come back from the children when they poll their parents?

Yes.

Paolo

> We could do the same thing as for bdrv_parent_drained_begin(), i.e. pass
> the parent that we came from (BdrvChild *parent instead of bool
> top_level, NULL instead of top_level=true) and then skip that parent
> while calling the BdrvChildRole .drain_poll callbacks. Would that
> address your concerns?
> 
> In that solution, splitting the function by moving aio_poll() out
> wouldn't get rid of a parameter and simplify the API any more. It might
> still be cleaner, though?
> 
> Kevin
> 




Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-12 Thread Kevin Wolf
Am 12.04.2018 um 10:37 hat Paolo Bonzini geschrieben:
> On 11/04/2018 18:39, Kevin Wolf wrote:
> > +bool bdrv_drain_poll(BlockDriverState *bs, bool top_level)
> >  {
> >  /* Execute pending BHs first and check everything else only after the 
> > BHs
> >   * have executed. */
> > -while (aio_poll(bs->aio_context, false));
> > +if (top_level) {
> > +while (aio_poll(bs->aio_context, false));
> > +}
> > +
> > +if (bdrv_parent_drained_poll(bs)) {
> > +return true;
> > +}
> > +
> >  return atomic_read(&bs->in_flight);
> >  }
> >  
> 
> Up until now I liked very much this series, but I like this patch a bit
> less for two reasons.
> 
> 1) I think I would prefer to have the !top_level case in a separate
> function---making the API easier to use in the BdrvChildRole callback
> because there is no need to pass false.

Basically just move the aio_poll() out to a different function that
calls bdrv_drain_poll afterwards? Maybe it's a bit cleaner, yes.
However, see below.

> In addition, the callback is not really polling anything, but rather
> returning whether the children are quiescent.  So a better name would
> be bdrv_children_drained and c->role->drained.

Why isn't it polling? We're actively checking the state of the node, and
we keep calling the callback until it has the expected state. Would it
only become polling for you if the loop were in this function rather
than the its caller?

> 2) Worse, the main idea behind the first drain restructuring was that
> draining could proceed in topological order: first drain the roots' I/O,
> then call bdrv_drain to send the last requests to their children, then
> recurse.  It is not clear to me why you need to introduce this recursive
> step, which is also O(n^2) in the worst case.

I need to introduce it because it fixes the bug that we don't wait until
the parents are actually quiesced and don't send new requests any more.
I don't see how this could be fixed without going to the parents.

Is the O(n²) that you mean that we recursively iterate all children in
bdrv_do_drained_begin() (or later in the series in bdrv_drain_poll()),
and then come back from the children when they poll their parents?

We could do the same thing as for bdrv_parent_drained_begin(), i.e. pass
the parent that we came from (BdrvChild *parent instead of bool
top_level, NULL instead of top_level=true) and then skip that parent
while calling the BdrvChildRole .drain_poll callbacks. Would that
address your concerns?

In that solution, splitting the function by moving aio_poll() out
wouldn't get rid of a parameter and simplify the API any more. It might
still be cleaner, though?

Kevin



Re: [Qemu-block] [PATCH 07/19] block: Really pause block jobs on drain

2018-04-12 Thread Paolo Bonzini
On 11/04/2018 18:39, Kevin Wolf wrote:
> +bool bdrv_drain_poll(BlockDriverState *bs, bool top_level)
>  {
>  /* Execute pending BHs first and check everything else only after the BHs
>   * have executed. */
> -while (aio_poll(bs->aio_context, false));
> +if (top_level) {
> +while (aio_poll(bs->aio_context, false));
> +}
> +
> +if (bdrv_parent_drained_poll(bs)) {
> +return true;
> +}
> +
>  return atomic_read(&bs->in_flight);
>  }
>  

Up until now I liked very much this series, but I like this patch a bit
less for two reasons.

1) I think I would prefer to have the !top_level case in a separate
function---making the API easier to use in the BdrvChildRole callback
because there is no need to pass false.  In addition, the callback is
not really polling anything, but rather returning whether the children
are quiescent.  So a better name would be bdrv_children_drained and
c->role->drained.

2) Worse, the main idea behind the first drain restructuring was that
draining could proceed in topological order: first drain the roots' I/O,
then call bdrv_drain to send the last requests to their children, then
recurse.  It is not clear to me why you need to introduce this recursive
step, which is also O(n^2) in the worst case.

Paolo