Re: [PATCH] backup: don't acquire aio_context in backup_clean

2020-03-26 Thread Vladimir Sementsov-Ogievskiy

26.03.2020 12:43, Stefan Reiter wrote:

On 26/03/2020 06:54, Vladimir Sementsov-Ogievskiy wrote:

25.03.2020 18:50, Stefan Reiter wrote:

backup_clean is only ever called as a handler via job_exit, which


Hmm.. I'm afraid it's not quite correct.

job_clean

   job_finalize_single

  job_completed_txn_abort (lock aio context)

  job_do_finalize


Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock aio 
context..
And on the same time, it directaly calls job_txn_apply(job->txn, 
job_finalize_single)
without locking. Is it a bug?



I think, as you say, the idea is that job_do_finalize is always called with the lock 
acquired. That's why job_completed_txn_abort takes care to release the lock (at least of 
the "outer_ctx" as it calls it) before reacquiring it.


And, even if job_do_finalize called always with locked context, where is 
guarantee that all
context of all jobs in txn are locked?



I also don't see anything that guarantees that... I guess it could be adapted 
to handle locks like job_completed_txn_abort does?

Haven't looked into transactions too much, but does it even make sense to have 
jobs in different contexts in one transaction?


Why not? Assume backing two disks in one transaction, each in separate io 
thread.. (honestly, I don't know does it work)




Still, let's look through its callers.

   job_finalize

    qmp_block_job_finalize (lock aio context)
    qmp_job_finalize (lock aio context)
    test_cancel_concluded (doesn't lock, but it's a test)

   job_completed_txn_success

    job_completed

 job_exit (lock aio context)

 job_cancel

  blockdev_mark_auto_del (lock aio context)

  job_user_cancel

  qmp_block_job_cancel (locks context)
  qmp_job_cancel  (locks context)

  job_cancel_err

   job_cancel_sync (return job_finish_sync(job, 
&job_cancel_err, NULL);, job_finish_sync just calls callback)

    replication_close (it's .bdrv_close.. Hmm, 
I don't see context locking, where is it ?)

Hm, don't see it either. This might indeed be a way to get to job_clean without 
a lock held.

I don't have any testing set up for replication atm, but if you believe this 
would be correct I can send a patch for that as well (just acquire the lock in 
replication_close before job_cancel_async?).


I don't know.. But sending a patch is good way to start a discussion)





    replication_stop (locks context)

    drive_backup_abort (locks context)

    blockdev_backup_abort (locks context)

    job_cancel_sync_all (locks context)

    cancel_common (locks context)

  test_* (I don't care)



To clarify, aside from the commit message the patch itself does not appear to 
be wrong? All paths (aside from replication_close mentioned above) guarantee 
the job lock to be held.


I mostly worry about the case with transaction with jobs from different aio 
contexts than about replication..

Anyway, I hope that someone who has better understanding of these things will 
look at this.

It usually not good idea to send [PATCH] inside discussion thread, it'd better 
be a separate thread, to be more visible.

May be you send separate series, which will include this patch, some fix for 
replication, and try to fix job_do_finalize in some way, and we continue 
discussion from this new series?




already acquires the job's context. The job's context is guaranteed to
be the same as the one used by backup_top via backup_job_create.

Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE
in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
once, thus deadlocking with the IO thread.

Signed-off-by: Stefan Reiter 


Just note, that this thing were recently touched by 0abf2581717a19 , so add 
Sergio (its author) to CC.


---

This is a fix for the issue discussed in this part of the thread:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
...not the original problem (core dump) posted by Dietmar.

I've still seen it occasionally hang during a backup abort. I'm trying to figure
out why that happens, stack trace indicates a similar problem with the main
thread hanging at bdrv_do_drained_begin, though I have no clue why as of yet.

  block/backup.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
  static void backup_clean(Job *job)

Re: [PATCH] backup: don't acquire aio_context in backup_clean

2020-03-26 Thread Sergio Lopez
On Thu, Mar 26, 2020 at 08:54:53AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 25.03.2020 18:50, Stefan Reiter wrote:
> > backup_clean is only ever called as a handler via job_exit, which
> 
> Hmm.. I'm afraid it's not quite correct.
> 
> job_clean
> 
>   job_finalize_single
> 
>  job_completed_txn_abort (lock aio context)
> 
>  job_do_finalize
> 
> 
> Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock aio 
> context..
> And on the same time, it directaly calls job_txn_apply(job->txn, 
> job_finalize_single)
> without locking. Is it a bug?

Indeed, looks like a bug to me. In fact, that's the one causing the
issue that Dietmar initially reported.

In think the proper fix is drop the context acquisition/release that
in backup_clean that I added in 0abf2581717a19, as Stefan proposed,
and also acquire the context of "foreign" jobs at job_txn_apply, just
as job_completed_txn_abort does.

Thanks,
Sergio.

> And, even if job_do_finalize called always with locked context, where is 
> guarantee that all
> context of all jobs in txn are locked?
> 
> Still, let's look through its callers.
> 
> job_finalize
> 
>qmp_block_job_finalize (lock aio context)
>qmp_job_finalize (lock aio context)
>test_cancel_concluded (doesn't lock, but it's a test)
> 
>   job_completed_txn_success
> 
>job_completed
> 
> job_exit (lock aio context)
> 
> job_cancel
>   
>  blockdev_mark_auto_del (lock aio context)
> 
>  job_user_cancel
> 
>  qmp_block_job_cancel (locks context)
>  qmp_job_cancel  (locks context)
> 
>  job_cancel_err
> 
>   job_cancel_sync (return job_finish_sync(job, 
> &job_cancel_err, NULL);, job_finish_sync just calls callback)
> 
>replication_close (it's .bdrv_close.. Hmm, 
> I don't see context locking, where is it ?)
> 
>replication_stop (locks context)
> 
>drive_backup_abort (locks context)
> 
>blockdev_backup_abort (locks context)
> 
>job_cancel_sync_all (locks context)
> 
>cancel_common (locks context)
> 
>  test_* (I don't care)
> 
> > already acquires the job's context. The job's context is guaranteed to
> > be the same as the one used by backup_top via backup_job_create.
> > 
> > Since the previous logic effectively acquired the lock twice, this
> > broke cleanup of backups for disks using IO threads, since the 
> > BDRV_POLL_WHILE
> > in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
> > once, thus deadlocking with the IO thread.
> > 
> > Signed-off-by: Stefan Reiter 
> 
> Just note, that this thing were recently touched by 0abf2581717a19 , so add 
> Sergio (its author) to CC.
> 
> > ---
> > 
> > This is a fix for the issue discussed in this part of the thread:
> > https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
> > ...not the original problem (core dump) posted by Dietmar.
> > 
> > I've still seen it occasionally hang during a backup abort. I'm trying to 
> > figure
> > out why that happens, stack trace indicates a similar problem with the main
> > thread hanging at bdrv_do_drained_begin, though I have no clue why as of 
> > yet.
> > 
> >   block/backup.c | 4 
> >   1 file changed, 4 deletions(-)
> > 
> > diff --git a/block/backup.c b/block/backup.c
> > index 7430ca5883..a7a7dcaf4c 100644
> > --- a/block/backup.c
> > +++ b/block/backup.c
> > @@ -126,11 +126,7 @@ static void backup_abort(Job *job)
> >   static void backup_clean(Job *job)
> >   {
> >   BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
> > -AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
> > -
> > -aio_context_acquire(aio_context);
> >   bdrv_backup_top_drop(s->backup_top);
> > -aio_context_release(aio_context);
> >   }
> >   void backup_do_checkpoint(BlockJob *job, Error **errp)
> > 
> 
> 
> -- 
> Best regards,
> Vladimir
> 


signature.asc
Description: PGP signature


Re: [PATCH] backup: don't acquire aio_context in backup_clean

2020-03-26 Thread Stefan Reiter

On 26/03/2020 06:54, Vladimir Sementsov-Ogievskiy wrote:

25.03.2020 18:50, Stefan Reiter wrote:

backup_clean is only ever called as a handler via job_exit, which


Hmm.. I'm afraid it's not quite correct.

job_clean

   job_finalize_single

  job_completed_txn_abort (lock aio context)

  job_do_finalize


Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock 
aio context..
And on the same time, it directaly calls job_txn_apply(job->txn, 
job_finalize_single)

without locking. Is it a bug?



I think, as you say, the idea is that job_do_finalize is always called 
with the lock acquired. That's why job_completed_txn_abort takes care to 
release the lock (at least of the "outer_ctx" as it calls it) before 
reacquiring it.


And, even if job_do_finalize called always with locked context, where is 
guarantee that all

context of all jobs in txn are locked?



I also don't see anything that guarantees that... I guess it could be 
adapted to handle locks like job_completed_txn_abort does?


Haven't looked into transactions too much, but does it even make sense 
to have jobs in different contexts in one transaction?



Still, let's look through its callers.

   job_finalize

    qmp_block_job_finalize (lock aio context)
    qmp_job_finalize (lock aio context)
    test_cancel_concluded (doesn't lock, but it's a test)

   job_completed_txn_success

    job_completed

     job_exit (lock aio context)

     job_cancel

  blockdev_mark_auto_del (lock aio context)

  job_user_cancel

  qmp_block_job_cancel (locks context)
  qmp_job_cancel  (locks context)

  job_cancel_err

   job_cancel_sync (return 
job_finish_sync(job, &job_cancel_err, NULL);, job_finish_sync just calls 
callback)


    replication_close (it's 
.bdrv_close.. Hmm, I don't see context locking, where is it ?)
Hm, don't see it either. This might indeed be a way to get to job_clean 
without a lock held.


I don't have any testing set up for replication atm, but if you believe 
this would be correct I can send a patch for that as well (just acquire 
the lock in replication_close before job_cancel_async?).




    replication_stop (locks context)

    drive_backup_abort (locks context)

    blockdev_backup_abort (locks context)

    job_cancel_sync_all (locks context)

    cancel_common (locks context)

  test_* (I don't care)



To clarify, aside from the commit message the patch itself does not 
appear to be wrong? All paths (aside from replication_close mentioned 
above) guarantee the job lock to be held.



already acquires the job's context. The job's context is guaranteed to
be the same as the one used by backup_top via backup_job_create.

Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since the 
BDRV_POLL_WHILE
in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release 
the lock

once, thus deadlocking with the IO thread.

Signed-off-by: Stefan Reiter 


Just note, that this thing were recently touched by 0abf2581717a19 , so 
add Sergio (its author) to CC.



---

This is a fix for the issue discussed in this part of the thread:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
...not the original problem (core dump) posted by Dietmar.

I've still seen it occasionally hang during a backup abort. I'm trying 
to figure
out why that happens, stack trace indicates a similar problem with the 
main
thread hanging at bdrv_do_drained_begin, though I have no clue why as 
of yet.


  block/backup.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
  static void backup_clean(Job *job)
  {
  BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
-    AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
-
-    aio_context_acquire(aio_context);
  bdrv_backup_top_drop(s->backup_top);
-    aio_context_release(aio_context);
  }
  void backup_do_checkpoint(BlockJob *job, Error **errp)









Re: [PATCH] backup: don't acquire aio_context in backup_clean

2020-03-25 Thread Vladimir Sementsov-Ogievskiy

25.03.2020 18:50, Stefan Reiter wrote:

backup_clean is only ever called as a handler via job_exit, which


Hmm.. I'm afraid it's not quite correct.

job_clean

  job_finalize_single

 job_completed_txn_abort (lock aio context)

 job_do_finalize


Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock aio 
context..
And on the same time, it directaly calls job_txn_apply(job->txn, 
job_finalize_single)
without locking. Is it a bug?

And, even if job_do_finalize called always with locked context, where is 
guarantee that all
context of all jobs in txn are locked?

Still, let's look through its callers.

  job_finalize

   qmp_block_job_finalize (lock aio context)
   qmp_job_finalize (lock aio context)
   test_cancel_concluded (doesn't lock, but it's a test)

  job_completed_txn_success

   job_completed

job_exit (lock aio context)

job_cancel

 blockdev_mark_auto_del (lock aio context)

 job_user_cancel

 qmp_block_job_cancel (locks context)
 qmp_job_cancel  (locks context)

 job_cancel_err

  job_cancel_sync (return job_finish_sync(job, 
&job_cancel_err, NULL);, job_finish_sync just calls callback)

   replication_close (it's .bdrv_close.. Hmm, I 
don't see context locking, where is it ?)

   replication_stop (locks context)

   drive_backup_abort (locks context)

   blockdev_backup_abort (locks context)

   job_cancel_sync_all (locks context)

   cancel_common (locks context)

 test_* (I don't care)


already acquires the job's context. The job's context is guaranteed to
be the same as the one used by backup_top via backup_job_create.

Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE
in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
once, thus deadlocking with the IO thread.

Signed-off-by: Stefan Reiter 


Just note, that this thing were recently touched by 0abf2581717a19 , so add 
Sergio (its author) to CC.


---

This is a fix for the issue discussed in this part of the thread:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
...not the original problem (core dump) posted by Dietmar.

I've still seen it occasionally hang during a backup abort. I'm trying to figure
out why that happens, stack trace indicates a similar problem with the main
thread hanging at bdrv_do_drained_begin, though I have no clue why as of yet.

  block/backup.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
  static void backup_clean(Job *job)
  {
  BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
-AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
-
-aio_context_acquire(aio_context);
  bdrv_backup_top_drop(s->backup_top);
-aio_context_release(aio_context);
  }
  
  void backup_do_checkpoint(BlockJob *job, Error **errp)





--
Best regards,
Vladimir



[PATCH] backup: don't acquire aio_context in backup_clean

2020-03-25 Thread Stefan Reiter
backup_clean is only ever called as a handler via job_exit, which
already acquires the job's context. The job's context is guaranteed to
be the same as the one used by backup_top via backup_job_create.

Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE
in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
once, thus deadlocking with the IO thread.

Signed-off-by: Stefan Reiter 
---

This is a fix for the issue discussed in this part of the thread:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
...not the original problem (core dump) posted by Dietmar.

I've still seen it occasionally hang during a backup abort. I'm trying to figure
out why that happens, stack trace indicates a similar problem with the main
thread hanging at bdrv_do_drained_begin, though I have no clue why as of yet.

 block/backup.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
 static void backup_clean(Job *job)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
-AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
-
-aio_context_acquire(aio_context);
 bdrv_backup_top_drop(s->backup_top);
-aio_context_release(aio_context);
 }
 
 void backup_do_checkpoint(BlockJob *job, Error **errp)
-- 
2.25.2