Re: bdrv_drained_begin deadlock with io-threads

2020-04-03 Thread Dietmar Maurer
> On April 3, 2020 10:47 AM Kevin Wolf wrote: > > > Am 03.04.2020 um 10:26 hat Dietmar Maurer geschrieben: > > > With the following patch, it seems to survive for now. I'll give it some > > > more testing tomorrow (also qemu-iotests to check that I didn't >

Re: bdrv_drained_begin deadlock with io-threads

2020-04-03 Thread Dietmar Maurer
> With the following patch, it seems to survive for now. I'll give it some > more testing tomorrow (also qemu-iotests to check that I didn't > accidentally break something else.) Wow, that was fast! Seems your patch fixes the bug! I wonder what commit introduced that problem, maybe:

Re: bdrv_drained_begin deadlock with io-threads

2020-04-02 Thread Dietmar Maurer
> It does looks more like your case because I now have bs.in_flight == 0 > and the BlockBackend of the scsi-hd device has in_flight == 8. yes, this looks very familiar. > Of course, this still doesn't answer why it happens, and I'm not sure if we > can tell without adding some debug code. > >

Re: bdrv_drained_begin deadlock with io-threads

2020-04-02 Thread Dietmar Maurer
> Can you reproduce the problem with my script, but pointing it to your > Debian image and running stress-ng instead of dd? yes > If so, how long does > it take to reproduce for you? I sometimes need up to 130 iterations ... Worse, I thought several times the bug is gone, but then it

Re: bdrv_drained_begin deadlock with io-threads

2020-04-02 Thread Dietmar Maurer
> > Do you also run "stress-ng -d 5" indied the VM? > > I'm not using the exact same test case, but something that I thought > would be similar enough. Specifically, I run the script below, which > boots from a RHEL 8 CD and in the rescue shell, I'll do 'dd if=/dev/zero > of=/dev/sda' This test

Re: bdrv_drained_begin deadlock with io-threads

2020-04-02 Thread Dietmar Maurer
> It seems to fix it, yes. Now I don't get any hangs any more. I just tested using your configuration, and a recent centos8 image running dd loop inside it: # while dd if=/dev/urandom of=testfile.raw bs=1M count=100; do sync; done With that, I am unable to trigger the bug. Would you mind

Re: bdrv_drained_begin deadlock with io-threads

2020-04-02 Thread Dietmar Maurer
> > But, IMHO the commit is not the reason for (my) bug - It just makes > > it easier to trigger... I can see (my) bug sometimes with 4.1.1, although > > I have no easy way to reproduce it reliable. > > > > Also, Stefan sent some patches to the list to fix some of the problems. > > > >

Re: bdrv_drained_begin deadlock with io-threads

2020-04-01 Thread Dietmar Maurer
> That's a pretty big change, and I'm not sure how it's related to > completed requests hanging in the thread pool instead of reentering the > file-posix coroutine. But I also tested it enough that I'm confident > it's really the first bad commit. > > Maybe you want to try if your problem starts

Re: bdrv_drained_begin deadlock with io-threads

2020-04-01 Thread Dietmar Maurer
> > I really nobody else able to reproduce this (somebody already tried to > > reproduce)? > > I can get hangs, but that's for job_completed(), not for starting the > job. Also, my hangs have a non-empty bs->tracked_requests, so it looks > like a different case to me. Please can you post the

Re: bdrv_drained_begin deadlock with io-threads

2020-04-01 Thread Dietmar Maurer
> On April 1, 2020 5:37 PM Dietmar Maurer wrote: > > > > > I really nobody else able to reproduce this (somebody already tried to > > > reproduce)? > > > > I can get hangs, but that's for job_completed(), not for starting the > > job. Also, m

Re: bdrv_drained_begin deadlock with io-threads

2020-03-31 Thread Dietmar Maurer
> On March 31, 2020 5:37 PM Kevin Wolf wrote: > > > Am 31.03.2020 um 17:24 hat Dietmar Maurer geschrieben: > > > > > > How can I see/debug those waiting request? > > > > > > Examine bs->tracked_requests list. > > > > &

Re: bdrv_drained_begin deadlock with io-threads

2020-03-31 Thread Dietmar Maurer
> > How can I see/debug those waiting request? > > Examine bs->tracked_requests list. > > BdrvTrackedRequest has "Coroutine *co" field. It's a pointer of coroutine of > this request. You may use qemu-gdb script to print request's coroutine > back-trace: I would, but there are no tracked

Re: bdrv_drained_begin deadlock with io-threads

2020-03-31 Thread Dietmar Maurer
> > After a few iteration the VM freeze inside bdrv_drained_begin(): > > > > Thread 1 (Thread 0x7fffe9291080 (LWP 30949)): > > #0 0x75cb3916 in __GI_ppoll (fds=0x7fff63d30c40, nfds=2, > > timeout=, timeout@entry=0x0, sigmask=sigmask@entry=0x0) at > >

Re: [PATCH v2 0/3] Fix some AIO context locking in jobs

2020-03-27 Thread Dietmar Maurer
> I *think* the second patch also fixes the hangs on backup abort that I and > Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent > before too. After more test, I am 100% sure the bug (or another one) is still there. Here is how to trigger: 1. use latest qemu sources from

Re: [PATCH v2 0/3] Fix some AIO context locking in jobs

2020-03-27 Thread Dietmar Maurer
Wait - maybe this was a bug in my test setup - I am unable to reproduce now.. @Stefan Reiter: Are you able to trigger this? > > I *think* the second patch also fixes the hangs on backup abort that I and > > Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent > > before too.

Re: [PATCH v2 0/3] Fix some AIO context locking in jobs

2020-03-27 Thread Dietmar Maurer
But I need to mention that it takes some time to reproduce this. This time I run/aborted about 500 backup jobs until it triggers. > > I *think* the second patch also fixes the hangs on backup abort that I and > > Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent > > before

Re: [PATCH v2 0/3] Fix some AIO context locking in jobs

2020-03-27 Thread Dietmar Maurer
> I *think* the second patch also fixes the hangs on backup abort that I and > Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent > before too. No, I still get this freeze: 0 0x7f0aa4866916 in __GI_ppoll (fds=0x7f0a12935c40, nfds=2, timeout=, timeout@entry=0x0,

Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Dietmar Maurer
> Let me elaborate: Yes, a cluster size generally means that it is most > “efficient” to access the storage at that size. But there’s a tradeoff. > At some point, reading the data takes sufficiently long that reading a > bit of metadata doesn’t matter anymore (usually, that is). Any network

Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Dietmar Maurer
> And if it issues a smaller request, there is no way for a guest device > to tell it “OK, here’s your data, but note we have a whole 4 MB chunk > around it, maybe you’d like to take that as well...?” > > I understand wanting to increase the backup buffer size, but I don’t > quite understand why

Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Dietmar Maurer
> On 6 November 2019 14:17 Max Reitz wrote: > > > On 06.11.19 14:09, Dietmar Maurer wrote: > >> Let me elaborate: Yes, a cluster size generally means that it is most > >> “efficient” to access the storage at that size. But there’s a tradeoff. > >>

Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Dietmar Maurer
> The thing is, it just seems unnecessary to me to take the source cluster > size into account in general. It seems weird that a medium only allows > 4 MB reads, because, well, guests aren’t going to take that into account. Maybe it is strange, but it is quite obvious that there is an optimal