> On April 3, 2020 10:47 AM Kevin Wolf wrote:
>
>
> Am 03.04.2020 um 10:26 hat Dietmar Maurer geschrieben:
> > > With the following patch, it seems to survive for now. I'll give it some
> > > more testing tomorrow (also qemu-iotests to check that I didn't
>
> With the following patch, it seems to survive for now. I'll give it some
> more testing tomorrow (also qemu-iotests to check that I didn't
> accidentally break something else.)
Wow, that was fast! Seems your patch fixes the bug!
I wonder what commit introduced that problem, maybe:
> It does looks more like your case because I now have bs.in_flight == 0
> and the BlockBackend of the scsi-hd device has in_flight == 8.
yes, this looks very familiar.
> Of course, this still doesn't answer why it happens, and I'm not sure if we
> can tell without adding some debug code.
>
>
> Can you reproduce the problem with my script, but pointing it to your
> Debian image and running stress-ng instead of dd?
yes
> If so, how long does
> it take to reproduce for you?
I sometimes need up to 130 iterations ...
Worse, I thought several times the bug is gone, but then it
> > Do you also run "stress-ng -d 5" indied the VM?
>
> I'm not using the exact same test case, but something that I thought
> would be similar enough. Specifically, I run the script below, which
> boots from a RHEL 8 CD and in the rescue shell, I'll do 'dd if=/dev/zero
> of=/dev/sda'
This test
> It seems to fix it, yes. Now I don't get any hangs any more.
I just tested using your configuration, and a recent centos8 image
running dd loop inside it:
# while dd if=/dev/urandom of=testfile.raw bs=1M count=100; do sync; done
With that, I am unable to trigger the bug.
Would you mind
> > But, IMHO the commit is not the reason for (my) bug - It just makes
> > it easier to trigger... I can see (my) bug sometimes with 4.1.1, although
> > I have no easy way to reproduce it reliable.
> >
> > Also, Stefan sent some patches to the list to fix some of the problems.
> >
> >
> That's a pretty big change, and I'm not sure how it's related to
> completed requests hanging in the thread pool instead of reentering the
> file-posix coroutine. But I also tested it enough that I'm confident
> it's really the first bad commit.
>
> Maybe you want to try if your problem starts
> > I really nobody else able to reproduce this (somebody already tried to
> > reproduce)?
>
> I can get hangs, but that's for job_completed(), not for starting the
> job. Also, my hangs have a non-empty bs->tracked_requests, so it looks
> like a different case to me.
Please can you post the
> On April 1, 2020 5:37 PM Dietmar Maurer wrote:
>
>
> > > I really nobody else able to reproduce this (somebody already tried to
> > > reproduce)?
> >
> > I can get hangs, but that's for job_completed(), not for starting the
> > job. Also, m
> On March 31, 2020 5:37 PM Kevin Wolf wrote:
>
>
> Am 31.03.2020 um 17:24 hat Dietmar Maurer geschrieben:
> >
> > > > How can I see/debug those waiting request?
> > >
> > > Examine bs->tracked_requests list.
> > >
> &
> > How can I see/debug those waiting request?
>
> Examine bs->tracked_requests list.
>
> BdrvTrackedRequest has "Coroutine *co" field. It's a pointer of coroutine of
> this request. You may use qemu-gdb script to print request's coroutine
> back-trace:
I would, but there are no tracked
> > After a few iteration the VM freeze inside bdrv_drained_begin():
> >
> > Thread 1 (Thread 0x7fffe9291080 (LWP 30949)):
> > #0 0x75cb3916 in __GI_ppoll (fds=0x7fff63d30c40, nfds=2,
> > timeout=, timeout@entry=0x0, sigmask=sigmask@entry=0x0) at
> >
> I *think* the second patch also fixes the hangs on backup abort that I and
> Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent
> before too.
After more test, I am 100% sure the bug (or another one) is still there.
Here is how to trigger:
1. use latest qemu sources from
Wait - maybe this was a bug in my test setup - I am unable to reproduce now..
@Stefan Reiter: Are you able to trigger this?
> > I *think* the second patch also fixes the hangs on backup abort that I and
> > Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent
> > before too.
But I need to mention that it takes some time to reproduce this. This time I
run/aborted about 500 backup jobs until it triggers.
> > I *think* the second patch also fixes the hangs on backup abort that I and
> > Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent
> > before
> I *think* the second patch also fixes the hangs on backup abort that I and
> Dietmar noticed in v1, but I'm not sure, they we're somewhat intermittent
> before too.
No, I still get this freeze:
0 0x7f0aa4866916 in __GI_ppoll (fds=0x7f0a12935c40, nfds=2,
timeout=, timeout@entry=0x0,
> Let me elaborate: Yes, a cluster size generally means that it is most
> “efficient” to access the storage at that size. But there’s a tradeoff.
> At some point, reading the data takes sufficiently long that reading a
> bit of metadata doesn’t matter anymore (usually, that is).
Any network
> And if it issues a smaller request, there is no way for a guest device
> to tell it “OK, here’s your data, but note we have a whole 4 MB chunk
> around it, maybe you’d like to take that as well...?”
>
> I understand wanting to increase the backup buffer size, but I don’t
> quite understand why
> On 6 November 2019 14:17 Max Reitz wrote:
>
>
> On 06.11.19 14:09, Dietmar Maurer wrote:
> >> Let me elaborate: Yes, a cluster size generally means that it is most
> >> “efficient” to access the storage at that size. But there’s a tradeoff.
> >>
> The thing is, it just seems unnecessary to me to take the source cluster
> size into account in general. It seems weird that a medium only allows
> 4 MB reads, because, well, guests aren’t going to take that into account.
Maybe it is strange, but it is quite obvious that there is an optimal
21 matches
Mail list logo