Re: [PATCH v2 0/3] Fix some AIO context locking in jobs

2020-03-27 Thread Dietmar Maurer
> I *think* the second patch also fixes the hangs on backup abort that I and
> Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent
> before too.

After more test, I am 100% sure the bug (or another one) is still there. 
Here is how to trigger:

1. use latest qemu sources from githup
2. apply those 3 patches from Stefan
2. create a VM with virtio-scsis-single drive using io-thread
3. inside VM install Debian buster
4. inside VM, run "stress -d 5"

Then run a series of backups, aborting them after a few seconds:

# start loop 

qmp: { "execute": "drive-backup", "arguments": { "device": "drive-scsi0", 
"sync": "full", "target": "backup-scsi0.raw" } }

sleep 3 second

qmp: { "execute": "'block-job-cancel", "arguments": { "device": "drive-scsi0" } 
}

# end loop

After several iterations (mostly < 50) the VM freezes (this time somewhere 
inside drive_backup_prepare):


(gdb) bt
#0  0x7f61ea09e916 in __GI_ppoll (fds=0x7f6158130c40, nfds=2, 
timeout=, timeout@entry=0x0, sigmask=sigmask@entry=0x0)
at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x55f708401c79 in ppoll (__ss=0x0, __timeout=0x0, __nfds=, __fds=)
at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2  0x55f708401c79 in qemu_poll_ns (fds=, nfds=, timeout=timeout@entry=-1) at util/qemu-timer.c:335
#3  0x55f708404461 in fdmon_poll_wait (ctx=0x7f61dcd05e80, 
ready_list=0x7ffc4e7fbde8, timeout=-1) at util/fdmon-poll.c:79
#4  0x55f708403a47 in aio_poll (ctx=0x7f61dcd05e80, 
blocking=blocking@entry=true) at util/aio-posix.c:589
#5  0x55f708364c03 in bdrv_do_drained_begin (poll=, 
ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x7f61dcd4c500)
at block/io.c:429
#6  0x55f708364c03 in bdrv_do_drained_begin
(bs=0x7f61dcd4c500, recursive=, parent=0x0, 
ignore_bds_parents=, poll=) at block/io.c:395
#7  0x55f7081016f9 in drive_backup_prepare (common=0x7f61d9a0c280, 
errp=0x7ffc4e7fbf28) at blockdev.c:1755
#8  0x55f708103e6a in qmp_transaction 
(dev_list=dev_list@entry=0x7ffc4e7fbfa0, has_props=has_props@entry=false, 
props=0x7f61d9a304e8, 
props@entry=0x0, errp=errp@entry=0x7ffc4e7fbfd8) at blockdev.c:2401
#9  0x55f708105322 in blockdev_do_action (errp=0x7ffc4e7fbfd8, 
action=0x7ffc4e7fbf90) at blockdev.c:1054
#10 0x55f708105322 in qmp_drive_backup (backup=backup@entry=0x7ffc4e7fbfe0, 
errp=errp@entry=0x7ffc4e7fbfd8) at blockdev.c:3129
#11 0x55f7082c0101 in qmp_marshal_drive_backup (args=, 
ret=, errp=0x7ffc4e7fc0b8)
at qapi/qapi-commands-block-core.c:555
#12 0x55f7083b7338 in qmp_dispatch (cmds=0x55f708904000 , 
request=, allow_oob=)
at qapi/qmp-dispatch.c:155
#13 0x55f7082a1bd1 in monitor_qmp_dispatch (mon=0x7f61dcd15d80, 
req=) at monitor/qmp.c:145
#14 0x55f7082a23ba in monitor_qmp_bh_dispatcher (data=) at 
monitor/qmp.c:234
#15 0x55f708400205 in aio_bh_call (bh=0x7f61dd28f960) at util/async.c:164
#16 0x55f708400205 in aio_bh_poll (ctx=ctx@entry=0x7f61dd33ef80) at 
util/async.c:164
#17 0x55f70840388e in aio_dispatch (ctx=0x7f61dd33ef80) at 
util/aio-posix.c:380
#18 0x55f7084000ee in aio_ctx_dispatch (source=, 
callback=, user_data=) at util/async.c:298
#19 0x7f61ec069f2e in g_main_context_dispatch () at 
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#20 0x55f708402af8 in glib_pollfds_poll () at util/main-loop.c:219
#21 0x55f708402af8 in os_host_main_loop_wait (timeout=) at 
util/main-loop.c:242
#22 0x55f708402af8 in main_loop_wait (nonblocking=nonblocking@entry=0) at 
util/main-loop.c:518
#23 0x55f70809e589 in qemu_main_loop () at 
/home/dietmar/pve5-devel/mirror_qemu/softmmu/vl.c:1665
#24 0x55f707fa2c3e in main (argc=, argv=, 
envp=)
at /home/dietmar/pve5-devel/mirror_qemu/softmmu/main.c:49




Re: [PATCH v2 0/3] Fix some AIO context locking in jobs

2020-03-27 Thread Dietmar Maurer
Wait - maybe this was a bug in my test setup - I am unable to reproduce now..

@Stefan Reiter: Are you able to trigger this?

> > I *think* the second patch also fixes the hangs on backup abort that I and
> > Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent
> > before too.
> 
> No, I still get this freeze:
> 
> 0  0x7f0aa4866916 in __GI_ppoll (fds=0x7f0a12935c40, nfds=2, 
> timeout=, timeout@entry=0x0, sigmask=sigmask@entry=0x0)
> at ../sysdeps/unix/sysv/linux/ppoll.c:39
> #1  0x55d3a6c91d29 in ppoll (__ss=0x0, __timeout=0x0, __nfds= out>, __fds=)
> at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
> #2  0x55d3a6c91d29 in qemu_poll_ns (fds=, nfds= out>, timeout=timeout@entry=-1) at util/qemu-timer.c:335
> #3  0x55d3a6c94511 in fdmon_poll_wait (ctx=0x7f0a97505e80, 
> ready_list=0x7fff67e5c358, timeout=-1) at util/fdmon-poll.c:79
> #4  0x55d3a6c93af7 in aio_poll (ctx=0x7f0a97505e80, 
> blocking=blocking@entry=true) at util/aio-posix.c:589
> #5  0x55d3a6bf4cd3 in bdrv_do_drained_begin
> (poll=, ignore_bds_parents=false, parent=0x0, 
> recursive=false, bs=0x7f0a9754c280) at block/io.c:429
> #6  0x55d3a6bf4cd3 in bdrv_do_drained_begin
> (bs=0x7f0a9754c280, recursive=, parent=0x0, 
> ignore_bds_parents=, poll=) at block/io.c:395
> #7  0x55d3a6be5c87 in blk_drain (blk=0x7f0a97abcc00) at 
> block/block-backend.c:1617
> #8  0x55d3a6be686d in blk_unref (blk=0x7f0a97abcc00) at 
> block/block-backend.c:473
> #9  0x55d3a6b9e835 in block_job_free (job=0x7f0a15f44e00) at blockjob.c:89
> #10 0x55d3a6b9fe29 in job_unref (job=0x7f0a15f44e00) at job.c:376
> #11 0x55d3a6b9fe29 in job_unref (job=0x7f0a15f44e00) at job.c:368
> #12 0x55d3a6ba07aa in job_finish_sync (job=job@entry=0x7f0a15f44e00, 
> finish=finish@entry=
> 0x55d3a6ba0cd0 , errp=errp@entry=0x0) at job.c:1004
> #13 0x55d3a6ba0cee in job_cancel_sync (job=job@entry=0x7f0a15f44e00) at 
> job.c:947




Re: [PATCH v2 0/3] Fix some AIO context locking in jobs

2020-03-26 Thread Dietmar Maurer
But I need to mention that it takes some time to reproduce this. This time I
run/aborted about 500 backup jobs until it triggers.

> > I *think* the second patch also fixes the hangs on backup abort that I and
> > Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent
> > before too.
> 
> No, I still get this freeze:
> 
> 0  0x7f0aa4866916 in __GI_ppoll (fds=0x7f0a12935c40, nfds=2, 
> timeout=, timeout@entry=0x0, sigmask=sigmask@entry=0x0)
> at ../sysdeps/unix/sysv/linux/ppoll.c:39
> #1  0x55d3a6c91d29 in ppoll (__ss=0x0, __timeout=0x0, __nfds= out>, __fds=)
> at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
> #2  0x55d3a6c91d29 in qemu_poll_ns (fds=, nfds= out>, timeout=timeout@entry=-1) at util/qemu-timer.c:335
> #3  0x55d3a6c94511 in fdmon_poll_wait (ctx=0x7f0a97505e80, 
> ready_list=0x7fff67e5c358, timeout=-1) at util/fdmon-poll.c:79
> #4  0x55d3a6c93af7 in aio_poll (ctx=0x7f0a97505e80, 
> blocking=blocking@entry=true) at util/aio-posix.c:589
> #5  0x55d3a6bf4cd3 in bdrv_do_drained_begin
> (poll=, ignore_bds_parents=false, parent=0x0, 
> recursive=false, bs=0x7f0a9754c280) at block/io.c:429
> #6  0x55d3a6bf4cd3 in bdrv_do_drained_begin
> (bs=0x7f0a9754c280, recursive=, parent=0x0, 
> ignore_bds_parents=, poll=) at block/io.c:395
> #7  0x55d3a6be5c87 in blk_drain (blk=0x7f0a97abcc00) at 
> block/block-backend.c:1617
> #8  0x55d3a6be686d in blk_unref (blk=0x7f0a97abcc00) at 
> block/block-backend.c:473
> #9  0x55d3a6b9e835 in block_job_free (job=0x7f0a15f44e00) at blockjob.c:89
> #10 0x55d3a6b9fe29 in job_unref (job=0x7f0a15f44e00) at job.c:376
> #11 0x55d3a6b9fe29 in job_unref (job=0x7f0a15f44e00) at job.c:368
> #12 0x55d3a6ba07aa in job_finish_sync (job=job@entry=0x7f0a15f44e00, 
> finish=finish@entry=
> 0x55d3a6ba0cd0 , errp=errp@entry=0x0) at job.c:1004
> #13 0x55d3a6ba0cee in job_cancel_sync (job=job@entry=0x7f0a15f44e00) at 
> job.c:947




Re: [PATCH v2 0/3] Fix some AIO context locking in jobs

2020-03-26 Thread Dietmar Maurer


> I *think* the second patch also fixes the hangs on backup abort that I and
> Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent
> before too.

No, I still get this freeze:

0  0x7f0aa4866916 in __GI_ppoll (fds=0x7f0a12935c40, nfds=2, 
timeout=, timeout@entry=0x0, sigmask=sigmask@entry=0x0)
at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x55d3a6c91d29 in ppoll (__ss=0x0, __timeout=0x0, __nfds=, __fds=)
at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2  0x55d3a6c91d29 in qemu_poll_ns (fds=, nfds=, timeout=timeout@entry=-1) at util/qemu-timer.c:335
#3  0x55d3a6c94511 in fdmon_poll_wait (ctx=0x7f0a97505e80, 
ready_list=0x7fff67e5c358, timeout=-1) at util/fdmon-poll.c:79
#4  0x55d3a6c93af7 in aio_poll (ctx=0x7f0a97505e80, 
blocking=blocking@entry=true) at util/aio-posix.c:589
#5  0x55d3a6bf4cd3 in bdrv_do_drained_begin
(poll=, ignore_bds_parents=false, parent=0x0, 
recursive=false, bs=0x7f0a9754c280) at block/io.c:429
#6  0x55d3a6bf4cd3 in bdrv_do_drained_begin
(bs=0x7f0a9754c280, recursive=, parent=0x0, 
ignore_bds_parents=, poll=) at block/io.c:395
#7  0x55d3a6be5c87 in blk_drain (blk=0x7f0a97abcc00) at 
block/block-backend.c:1617
#8  0x55d3a6be686d in blk_unref (blk=0x7f0a97abcc00) at 
block/block-backend.c:473
#9  0x55d3a6b9e835 in block_job_free (job=0x7f0a15f44e00) at blockjob.c:89
#10 0x55d3a6b9fe29 in job_unref (job=0x7f0a15f44e00) at job.c:376
#11 0x55d3a6b9fe29 in job_unref (job=0x7f0a15f44e00) at job.c:368
#12 0x55d3a6ba07aa in job_finish_sync (job=job@entry=0x7f0a15f44e00, 
finish=finish@entry=
0x55d3a6ba0cd0 , errp=errp@entry=0x0) at job.c:1004
#13 0x55d3a6ba0cee in job_cancel_sync (job=job@entry=0x7f0a15f44e00) at 
job.c:947




Re: [PATCH v2 0/3] Fix some AIO context locking in jobs

2020-03-26 Thread no-reply
Patchew URL: 
https://patchew.org/QEMU/20200326155628.859862-1-s.rei...@proxmox.com/



Hi,

This series failed the asan build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-debug@fedora TARGET_LIST=x86_64-softmmu J=14 NETWORK=1
=== TEST SCRIPT END ===

PASS 1 fdc-test /x86_64/fdc/cmos
PASS 2 fdc-test /x86_64/fdc/no_media_on_start
PASS 3 fdc-test /x86_64/fdc/read_without_media
==6158==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 4 fdc-test /x86_64/fdc/media_change
PASS 5 fdc-test /x86_64/fdc/sense_interrupt
PASS 6 fdc-test /x86_64/fdc/relative_seek
---
PASS 32 test-opts-visitor /visitor/opts/range/beyond
PASS 33 test-opts-visitor /visitor/opts/dict/unvisited
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
tests/test-coroutine -m=quick -k --tap < /dev/null | ./scripts/tap-driver.pl 
--test-name="test-coroutine" 
==6219==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
==6219==WARNING: ASan is ignoring requested __asan_handle_no_return: stack top: 
0x7ffcb9e2d000; bottom 0x7fc4e542; size: 0x0037d4a0d000 (239790510080)
False positive error reports may follow
For details see https://github.com/google/sanitizers/issues/189
PASS 1 test-coroutine /basic/no-dangling-access
---
PASS 11 test-aio /aio/event/wait
PASS 12 test-aio /aio/event/flush
PASS 13 test-aio /aio/event/wait/no-flush-cb
==6234==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 14 test-aio /aio/timer/schedule
PASS 15 test-aio /aio/coroutine/queue-chaining
PASS 16 test-aio /aio-gsource/flush
---
PASS 12 fdc-test /x86_64/fdc/read_no_dma_19
PASS 13 fdc-test /x86_64/fdc/fuzz-registers
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
QTEST_QEMU_BINARY=x86_64-softmmu/qemu-system-x86_64 QTEST_QEMU_IMG=qemu-img 
tests/qtest/ide-test -m=quick -k --tap < /dev/null | ./scripts/tap-driver.pl 
--test-name="ide-test" 
==6242==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 28 test-aio /aio-gsource/timer/schedule
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
tests/test-aio-multithread -m=quick -k --tap < /dev/null | 
./scripts/tap-driver.pl --test-name="test-aio-multithread" 
PASS 1 ide-test /x86_64/ide/identify
==6249==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 1 test-aio-multithread /aio/multi/lifecycle
==6251==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 2 ide-test /x86_64/ide/flush
PASS 2 test-aio-multithread /aio/multi/schedule
==6268==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 3 ide-test /x86_64/ide/bmdma/simple_rw
==6279==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 3 test-aio-multithread /aio/multi/mutex/contended
PASS 4 ide-test /x86_64/ide/bmdma/trim
==6290==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 4 test-aio-multithread /aio/multi/mutex/handoff
PASS 5 test-aio-multithread /aio/multi/mutex/mcs
PASS 6 test-aio-multithread /aio/multi/mutex/pthread
---
PASS 6 test-throttle /throttle/detach_attach
PASS 7 test-throttle /throttle/config_functions
PASS 8 test-throttle /throttle/accounting
==6307==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 9 test-throttle /throttle/groups
PASS 10 test-throttle /throttle/config/enabled
PASS 11 test-throttle /throttle/config/conflicting
---
PASS 14 test-throttle /throttle/config/max
PASS 15 test-throttle /throttle/config/iops_size
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
tests/test-thread-pool -m=quick -k --tap < /dev/null | ./scripts/tap-driver.pl 
--test-name="test-thread-pool" 
==6311==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 1 test-thread-pool /thread-pool/submit
PASS 2 test-thread-pool /thread-pool/submit-aio
PASS 3 test-thread-pool /thread-pool/submit-co
PASS 4 test-thread-pool /thread-pool/submit-many
==6378==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 5 test-thread-pool /thread-pool/cancel
PASS 6 test-thread-pool /thread-pool/cancel-async
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
tests/test-hbit

Re: [PATCH v2 0/3] Fix some AIO context locking in jobs

2020-03-26 Thread no-reply
Patchew URL: 
https://patchew.org/QEMU/20200326155628.859862-1-s.rei...@proxmox.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  TESTcheck-unit: tests/test-bdrv-graph-mod
  TESTcheck-unit: tests/test-blockjob
qemu: qemu_mutex_unlock_impl: Operation not permitted
ERROR - too few tests run (expected 8, got 7)
make: *** [check-unit] Error 1
make: *** Waiting for unfinished jobs
  TESTiotest-qcow2: 020
  TESTiotest-qcow2: 021
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=2724b2af546443c7a60d0f061f1beab7', '-u', 
'1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-7887uhcy/src/docker-src.2020-03-26-12.56.41.28979:/var/tmp/qemu:z,ro',
 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=2724b2af546443c7a60d0f061f1beab7
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-7887uhcy/src'
make: *** [docker-run-test-quick@centos7] Error 2

real14m17.758s
user0m8.171s


The full log is available at
http://patchew.org/logs/20200326155628.859862-1-s.rei...@proxmox.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

[PATCH v2 0/3] Fix some AIO context locking in jobs

2020-03-26 Thread Stefan Reiter
Contains three seperate but related patches cleaning up and fixing some
issues regarding aio_context_acquire/aio_context_release for jobs. Mostly
affects blockjobs running for devices that have IO threads enabled AFAICT.

This is based on the discussions here:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07929.html

I *think* the second patch also fixes the hangs on backup abort that I and
Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent
before too.

Changes from v1:
* fixed commit message for patch 1
* added patches 2 and 3

Stefan Reiter (3):
  backup: don't acquire aio_context in backup_clean
  job: take each job's lock individually in job_txn_apply
  replication: acquire aio context before calling job_cancel_sync

 block/backup.c  |  4 
 block/replication.c |  6 +-
 job.c   | 32 
 3 files changed, 29 insertions(+), 13 deletions(-)

-- 
2.26.0