Re: [pve-devel] pvedaemon hanging because of qga retry
Hi, I have notice that we already send a guest-ping in PVE::QemuServer::qga_check_running($vmid); sub qga_check_running { my ($vmid) = @_; eval { vm_mon_cmd($vmid, "guest-ping", timeout => 3); }; if ($@) { warn "Qemu Guest Agent is not running - $@"; return 0; } return 1; } (already use in vzdump and other parts). ex: if ($self->{vmlist}->{$vmid}->{agent} && $vm_is_running) { $agent_running = PVE::QemuServer::qga_check_running($vmid); } if ($agent_running){ eval { PVE::QemuServer::vm_mon_cmd($vmid, "guest-fsfreeze-freeze"); }; if (my $err = $@) { $self->logerr($err); } } My problem is that I'm using "qm agent " , and we don't have this ping /PVE/API2/Qemu/Agent.pm die "No Qemu Guest Agent\n" if !defined($conf->{agent}); die "VM $vmid is not running\n" if !PVE::QemuServer::check_running($vmid); my $cmd = $param->{command} // $command; my $res = PVE::QemuServer::vm_mon_cmd($vmid, "guest-$cmd"); I'll send a patch - Mail original - De: "aderumier" À: "Thomas Lamprecht" Cc: "pve-devel" Envoyé: Mardi 22 Mai 2018 09:59:37 Objet: Re: [pve-devel] pvedaemon hanging because of qga retry >>But, AFAICT, this isn't your real concern yes, indeed. it's normal to have a high timeout for fsfreeze (libvirt also do it). >>you propose to make a "simple" >>qmp call, be it through the VSERPORT_CHANGE, or a backward compatible ping, >>where we know that the time needed to answer cannot be that high, as no IO >>is involved. exactly ! >> That could be done with a relative small timeout and if that >>fails we know that it doesn't makes sense to make the fsfreeze call with it >>- reasonable - high timeout. If I understood correctly? yes ! - Mail original - De: "Thomas Lamprecht" À: "pve-devel" , "aderumier" , "dietmar" Envoyé: Mardi 22 Mai 2018 09:56:13 Objet: Re: [pve-devel] pvedaemon hanging because of qga retry On 5/21/18 3:02 PM, Alexandre DERUMIER wrote: >>> Seems this patch does not solve the 'high load' problem at all? > > I can't reproduce this high load, so I can't say. For the high fsfreeze timeout my commit message should provide some context: > commit cfb7a70165199eca25f92272490c863551efcd89 > Author: Thomas Lamprecht > Date: Wed Nov 23 11:40:41 2016 +0100 > > increase timeout from guest-fsfreeze-freeze > > The qmp command 'guest-fsfreeze-freeze' issues in linux a FIFREEZE > ioctl call on all mounted guest FS. > This ioctl call locks the filesystem and gets it into an consistent > state. For this all caches must be synced after blocking new writes > to the FS, which may need a relative long time, especially under high > IO load on the backing storage. > > In windows a VSS (Volume Shadow Copy Service) request_freeze will > issued. As of the closed Windows nature the exact mechanisms cannot > be checked but some microsoft blog posts and other forum post suggest > that it should return fast but certain workloads can still trigger a > long delay resulting an similar problems. > > Thus try to minimize the error probability and increase the timeout > significantly. > We use 60 minutes as timeout as this seems a limit which should not > get trespassed in a somewhat healthy system. > > See: > https://forum.proxmox.com/threads/22192/ > > see the 'freeze_super' and 'thaw_super' function in fs/super.c from > the linux kernel tree for more details on the freeze behavior in > Linux guests. > My main concern is to not wait for a down daemon. (which will never > response). > > If we can be sure that daemon is running, with high load, simply wait for a > response with a longer timeout. > > But, AFAICT, this isn't your real concern, you propose to make a "simple" qmp call, be it through the VSERPORT_CHANGE, or a backward compatible ping, where we know that the time needed to answer cannot be that high, as no IO is involved. That could be done with a relative small timeout and if that fails we know that it doesn't makes sense to make the fsfreeze call with it - reasonable - high timeout. If I understood correctly? > > > - Mail original - > De: "dietmar" > À: "aderumier" > Cc: "pve-devel" > Envoyé: Lundi 21 Mai 2018 09:56:03 > Objet: Re: [pve-devel] pvedaemon hanging because of qga retry > >> I have loo
Re: [pve-devel] pvedaemon hanging because of qga retry
>>But, AFAICT, this isn't your real concern yes, indeed. it's normal to have a high timeout for fsfreeze (libvirt also do it). >>you propose to make a "simple" >>qmp call, be it through the VSERPORT_CHANGE, or a backward compatible ping, >>where we know that the time needed to answer cannot be that high, as no IO >>is involved. exactly ! >> That could be done with a relative small timeout and if that >>fails we know that it doesn't makes sense to make the fsfreeze call with it >>- reasonable - high timeout. If I understood correctly? yes ! - Mail original - De: "Thomas Lamprecht" À: "pve-devel" , "aderumier" , "dietmar" Envoyé: Mardi 22 Mai 2018 09:56:13 Objet: Re: [pve-devel] pvedaemon hanging because of qga retry On 5/21/18 3:02 PM, Alexandre DERUMIER wrote: >>> Seems this patch does not solve the 'high load' problem at all? > > I can't reproduce this high load, so I can't say. For the high fsfreeze timeout my commit message should provide some context: > commit cfb7a70165199eca25f92272490c863551efcd89 > Author: Thomas Lamprecht > Date: Wed Nov 23 11:40:41 2016 +0100 > > increase timeout from guest-fsfreeze-freeze > > The qmp command 'guest-fsfreeze-freeze' issues in linux a FIFREEZE > ioctl call on all mounted guest FS. > This ioctl call locks the filesystem and gets it into an consistent > state. For this all caches must be synced after blocking new writes > to the FS, which may need a relative long time, especially under high > IO load on the backing storage. > > In windows a VSS (Volume Shadow Copy Service) request_freeze will > issued. As of the closed Windows nature the exact mechanisms cannot > be checked but some microsoft blog posts and other forum post suggest > that it should return fast but certain workloads can still trigger a > long delay resulting an similar problems. > > Thus try to minimize the error probability and increase the timeout > significantly. > We use 60 minutes as timeout as this seems a limit which should not > get trespassed in a somewhat healthy system. > > See: > https://forum.proxmox.com/threads/22192/ > > see the 'freeze_super' and 'thaw_super' function in fs/super.c from > the linux kernel tree for more details on the freeze behavior in > Linux guests. > My main concern is to not wait for a down daemon. (which will never > response). > > If we can be sure that daemon is running, with high load, simply wait for a > response with a longer timeout. > > But, AFAICT, this isn't your real concern, you propose to make a "simple" qmp call, be it through the VSERPORT_CHANGE, or a backward compatible ping, where we know that the time needed to answer cannot be that high, as no IO is involved. That could be done with a relative small timeout and if that fails we know that it doesn't makes sense to make the fsfreeze call with it - reasonable - high timeout. If I understood correctly? > > > - Mail original - > De: "dietmar" > À: "aderumier" > Cc: "pve-devel" > Envoyé: Lundi 21 Mai 2018 09:56:03 > Objet: Re: [pve-devel] pvedaemon hanging because of qga retry > >> I have looked at libvirt/ovirt. >> >> It seem that's it's possible to detect if agent is connected, through a qmp >> event VSERPORT_CHANGE. >> >> https://git.qemu.org/?p=qemu.git;a=commit;h=e2ae6159 >> https://git.qemu.org/?p=qemu.git;a=blobdiff;f=docs/qmp/qmp-events.txt;h=d759d197486a3edf3b629fb11e9922ad92fb041a;hp=9d7439e3073ac63b639ce282c7466933ccb411b4;hb=032baddea36330384b3654fcbfafa74cc815471c;hpb=db52658b38fea4e54c23c9cfbced9478d368aa84 >> > > Seems this patch does not solve the 'high load' problem at all? > > ___ > pve-devel mailing list > pve-devel@pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pvedaemon hanging because of qga retry
On 5/21/18 3:02 PM, Alexandre DERUMIER wrote: >>> Seems this patch does not solve the 'high load' problem at all? > > I can't reproduce this high load, so I can't say. For the high fsfreeze timeout my commit message should provide some context: > commit cfb7a70165199eca25f92272490c863551efcd89 > Author: Thomas Lamprecht > Date: Wed Nov 23 11:40:41 2016 +0100 > > increase timeout from guest-fsfreeze-freeze > > The qmp command 'guest-fsfreeze-freeze' issues in linux a FIFREEZE > ioctl call on all mounted guest FS. > This ioctl call locks the filesystem and gets it into an consistent > state. For this all caches must be synced after blocking new writes > to the FS, which may need a relative long time, especially under high > IO load on the backing storage. > > In windows a VSS (Volume Shadow Copy Service) request_freeze will > issued. As of the closed Windows nature the exact mechanisms cannot > be checked but some microsoft blog posts and other forum post suggest > that it should return fast but certain workloads can still trigger a > long delay resulting an similar problems. > > Thus try to minimize the error probability and increase the timeout > significantly. > We use 60 minutes as timeout as this seems a limit which should not > get trespassed in a somewhat healthy system. > > See: > https://forum.proxmox.com/threads/22192/ > > see the 'freeze_super' and 'thaw_super' function in fs/super.c from > the linux kernel tree for more details on the freeze behavior in > Linux guests. > My main concern is to not wait for a down daemon. (which will never response). > > If we can be sure that daemon is running, with high load, simply wait for a > response with a longer timeout. > > But, AFAICT, this isn't your real concern, you propose to make a "simple" qmp call, be it through the VSERPORT_CHANGE, or a backward compatible ping, where we know that the time needed to answer cannot be that high, as no IO is involved. That could be done with a relative small timeout and if that fails we know that it doesn't makes sense to make the fsfreeze call with it - reasonable - high timeout. If I understood correctly? > > > - Mail original - > De: "dietmar" > À: "aderumier" > Cc: "pve-devel" > Envoyé: Lundi 21 Mai 2018 09:56:03 > Objet: Re: [pve-devel] pvedaemon hanging because of qga retry > >> I have looked at libvirt/ovirt. >> >> It seem that's it's possible to detect if agent is connected, through a qmp >> event VSERPORT_CHANGE. >> >> https://git.qemu.org/?p=qemu.git;a=commit;h=e2ae6159 >> https://git.qemu.org/?p=qemu.git;a=blobdiff;f=docs/qmp/qmp-events.txt;h=d759d197486a3edf3b629fb11e9922ad92fb041a;hp=9d7439e3073ac63b639ce282c7466933ccb411b4;hb=032baddea36330384b3654fcbfafa74cc815471c;hpb=db52658b38fea4e54c23c9cfbced9478d368aa84 >> > > Seems this patch does not solve the 'high load' problem at all? > > ___ > pve-devel mailing list > pve-devel@pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pvedaemon hanging because of qga retry
> On May 17, 2018 at 11:16 PM Alexandre DERUMIER wrote: > > > Hi, > I had a strange behaviour today, > > with a vm running + qga enabled, but qga service down in the vm > > after theses attempts, > > May 17 21:54:01 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp > command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket > - timeout after 101 retries > May 17 21:55:10 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp > command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket > - timeout after 101 retries I am now trying to reproduce the problem. But AFAIK pvedaemon never calls 'guest-fsfreeze-thaw' directly, so how did you produce those logs exactly? ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pvedaemon hanging because of qga retry
> On May 21, 2018 at 3:02 PM Alexandre DERUMIER wrote: > > > >>Seems this patch does not solve the 'high load' problem at all? > > I can't reproduce this high load, so I can't say. Its holiday today in Austria. But I will ask our support team tomorrow how to reproduce this ... ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pvedaemon hanging because of qga retry
>>Seems this patch does not solve the 'high load' problem at all? I can't reproduce this high load, so I can't say. My main concern is to not wait for a down daemon. (which will never response). If we can be sure that daemon is running, with high load, simply wait for a response with a longer timeout. - Mail original - De: "dietmar" À: "aderumier" Cc: "pve-devel" Envoyé: Lundi 21 Mai 2018 09:56:03 Objet: Re: [pve-devel] pvedaemon hanging because of qga retry > I have looked at libvirt/ovirt. > > It seem that's it's possible to detect if agent is connected, through a qmp > event VSERPORT_CHANGE. > > https://git.qemu.org/?p=qemu.git;a=commit;h=e2ae6159 > https://git.qemu.org/?p=qemu.git;a=blobdiff;f=docs/qmp/qmp-events.txt;h=d759d197486a3edf3b629fb11e9922ad92fb041a;hp=9d7439e3073ac63b639ce282c7466933ccb411b4;hb=032baddea36330384b3654fcbfafa74cc815471c;hpb=db52658b38fea4e54c23c9cfbced9478d368aa84 > Seems this patch does not solve the 'high load' problem at all? ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pvedaemon hanging because of qga retry
> I have looked at libvirt/ovirt. > > It seem that's it's possible to detect if agent is connected, through a qmp > event VSERPORT_CHANGE. > > https://git.qemu.org/?p=qemu.git;a=commit;h=e2ae6159 > https://git.qemu.org/?p=qemu.git;a=blobdiff;f=docs/qmp/qmp-events.txt;h=d759d197486a3edf3b629fb11e9922ad92fb041a;hp=9d7439e3073ac63b639ce282c7466933ccb411b4;hb=032baddea36330384b3654fcbfafa74cc815471c;hpb=db52658b38fea4e54c23c9cfbced9478d368aa84 Seems this patch does not solve the 'high load' problem at all? ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pvedaemon hanging because of qga retry
> if think we shouldn't try to send after that the other command with 1 hour > timeout... >>Sure, once we run into a timeout we should not send the second command. I have looked at libvirt/ovirt. It seem that's it's possible to detect if agent is connected, through a qmp event VSERPORT_CHANGE. https://git.qemu.org/?p=qemu.git;a=commit;h=e2ae6159 https://git.qemu.org/?p=qemu.git;a=blobdiff;f=docs/qmp/qmp-events.txt;h=d759d197486a3edf3b629fb11e9922ad92fb041a;hp=9d7439e3073ac63b639ce282c7466933ccb411b4;hb=032baddea36330384b3654fcbfafa74cc815471c;hpb=db52658b38fea4e54c23c9cfbced9478d368aa84 previously, they send a guest-ping with a short timeout before each command, and since this commit https://www.redhat.com/archives/libvir-list/2014-November/msg00708.html they catch the VSERPORT_CHANGE to known if the daemon is connected or not to the serial device. We don't catch events currently, but maybe could we patch qemu to store the status somewhere instead event only, and retrieve it with a qmp command before calling qga ? - Mail original - De: "dietmar" À: "aderumier" Cc: "pve-devel" Envoyé: Dimanche 20 Mai 2018 15:51:37 Objet: Re: [pve-devel] pvedaemon hanging because of qga retry > if the guest is so loaded, than it can't even send reponse to guest-ping > (with > a "short" timeout of some seconds, not ms!), AFAIK this is quite common ... (unfortunately). We had several support cases in the past (several seconds delay), but I do not remember how to reproduce. I will ask the support team next week ... > if think we shouldn't try to send after that the other command with 1 hour > timeout... Sure, once we run into a timeout we should not send the second command. ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pvedaemon hanging because of qga retry
> if the guest is so loaded, than it can't even send reponse to guest-ping (with > a "short" timeout of some seconds, not ms!), AFAIK this is quite common ... (unfortunately). We had several support cases in the past (several seconds delay), but I do not remember how to reproduce. I will ask the support team next week ... > if think we shouldn't try to send after that the other command with 1 hour > timeout... Sure, once we run into a timeout we should not send the second command. ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pvedaemon hanging because of qga retry
>>I think that will not work. I already tried to explain why in my previous >>post: >>The problem is that there is no way to decide if qga agent is running or not. >>You will simply run into the 'short' timeout as soon as there is some load on >>the >>server. What do you mean my "some load" ? "totally unresponsive" ? I have try with cpu benchmark simulator, with crazy load, and guest agent is still responding. if the guest is so loaded, than it can't even send reponse to guest-ping (with a "short" timeout of some seconds, not ms!), if think we shouldn't try to send after that the other command with 1 hour timeout... - Mail original - De: "dietmar" À: "aderumier" Cc: "pve-devel" Envoyé: Dimanche 20 Mai 2018 08:16:25 Objet: Re: [pve-devel] pvedaemon hanging because of qga retry > On May 20, 2018 at 3:22 AM Alexandre DERUMIER wrote: > > > I have notice something when agent daemon is down: > > #qm agent 124 ping > VM 124 qmp command 'guest-ping' failed - got timeout > #qm agent 124 ping > VM 124 qmp command 'guest-ping' failed - got timeout > #qm agent 124 ping > VM 124 qmp command 'guest-ping' failed - got timeout > #qm agent 124 ping > VM 124 qmp command 'guest-ping' failed - unable to connect to VM 124 qga > socket - timeout after 11 retries > > > Seem that after 3 request, we can't connect anymore to socket. > (I'm seeing same thing with socat directly to qga socket) > > > What I would like to have , to avoid big timeout (mainly for fsfreeze, this > is > the biggest with 1hour), > is to send first a guest-ping or maybe better guest-info, with a short > timeout. > if it's succesfull, then send the other query. I think that will not work. I already tried to explain why in my previous post: The problem is that there is no way to decide if qga agent is running or not. You will simply run into the 'short' timeout as soon as there is some load on the server. ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pvedaemon hanging because of qga retry
> On May 20, 2018 at 3:22 AM Alexandre DERUMIER wrote: > > > I have notice something when agent daemon is down: > > #qm agent 124 ping > VM 124 qmp command 'guest-ping' failed - got timeout > #qm agent 124 ping > VM 124 qmp command 'guest-ping' failed - got timeout > #qm agent 124 ping > VM 124 qmp command 'guest-ping' failed - got timeout > #qm agent 124 ping > VM 124 qmp command 'guest-ping' failed - unable to connect to VM 124 qga > socket - timeout after 11 retries > > > Seem that after 3 request, we can't connect anymore to socket. > (I'm seeing same thing with socat directly to qga socket) > > > What I would like to have , to avoid big timeout (mainly for fsfreeze, this is > the biggest with 1hour), > is to send first a guest-ping or maybe better guest-info, with a short > timeout. > if it's succesfull, then send the other query. I think that will not work. I already tried to explain why in my previous post: The problem is that there is no way to decide if qga agent is running or not. You will simply run into the 'short' timeout as soon as there is some load on the server. ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pvedaemon hanging because of qga retry
I have notice something when agent daemon is down: #qm agent 124 ping VM 124 qmp command 'guest-ping' failed - got timeout #qm agent 124 ping VM 124 qmp command 'guest-ping' failed - got timeout #qm agent 124 ping VM 124 qmp command 'guest-ping' failed - got timeout #qm agent 124 ping VM 124 qmp command 'guest-ping' failed - unable to connect to VM 124 qga socket - timeout after 11 retries Seem that after 3 request, we can't connect anymore to socket. (I'm seeing same thing with socat directly to qga socket) What I would like to have , to avoid big timeout (mainly for fsfreeze, this is the biggest with 1hour), is to send first a guest-ping or maybe better guest-info, with a short timeout. if it's succesfull, then send the other query. something like , for example vzdump if ($agent_running){ eval { PVE::QemuServer::vm_mon_cmd($vmid, "guest-fsfreeze-freeze"); }; if (my $err = $@) { $self->logerr($err); } } ---> if ($agent_running){ eval { $res = PVE::QemuServer::vm_mon_cmd($vmid, "guest-info"); }; if (my $err = $@) { $self->logerr($err); } elsif($res->{supported_commands}->{name}->{guest-fsfreeze-freeze} { eval { PVE::QemuServer::vm_mon_cmd($vmid, "guest-fsfreeze-freeze"); }; if (my $err = $@) { $self->logerr($err); } } } Like this, I think we could test if the command exist with guest-info, with a short timeout, and after send the command with the bigger timeout. (+ benefit to log error if the command don't exist) Maybe create a specific sub to do it for any qga command. what do you think about this? - Mail original ----- De: "dietmar" À: "aderumier" Cc: "pve-devel" Envoyé: Vendredi 18 Mai 2018 19:03:19 Objet: Re: [pve-devel] pvedaemon hanging because of qga retry > >>If you simply skip commands like 'guest-fsfreeze-thaw' > >>your VM will get totally unusable (frozen). So I am not > >>sure what you want to suggest? > > I'm not sure, but don't we have 2 timeout here ? > > 1 for connect , and 1 for command execution ? what for? > I would like to be able to fast timeout on connect, as if qga agent is not > running, it can't connect. > and if qga is running, keep the long executing timeout as it seem to be > needed by fsfreeze-fs. The problem is that there is no way to decide if qga agent is running or not. You will simply run into the 'short' timeout soon as there is some load on the server. AFAIK many users complained about that. ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pvedaemon hanging because of qga retry
> >>If you simply skip commands like 'guest-fsfreeze-thaw' > >>your VM will get totally unusable (frozen). So I am not > >>sure what you want to suggest? > > I'm not sure, but don't we have 2 timeout here ? > > 1 for connect , and 1 for command execution ? what for? > I would like to be able to fast timeout on connect, as if qga agent is not > running, it can't connect. > and if qga is running, keep the long executing timeout as it seem to be > needed by fsfreeze-fs. The problem is that there is no way to decide if qga agent is running or not. You will simply run into the 'short' timeout soon as there is some load on the server. AFAIK many users complained about that. ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pvedaemon hanging because of qga retry
>>If you simply skip commands like 'guest-fsfreeze-thaw' >>your VM will get totally unusable (frozen). So I am not >>sure what you want to suggest? I'm not sure, but don't we have 2 timeout here ? 1 for connect , and 1 for command execution ? I would like to be able to fast timeout on connect, as if qga agent is not running, it can't connect. and if qga is running, keep the long executing timeout as it seem to be needed by fsfreeze-fs. >>A correct fix would be to implement an async command queue inside qemu... oh, I thinked it was already async. - Mail original - De: "dietmar" À: "aderumier" , "pve-devel" Envoyé: Vendredi 18 Mai 2018 07:27:54 Objet: Re: [pve-devel] pvedaemon hanging because of qga retry If you simply skip commands like 'guest-fsfreeze-thaw' your VM will get totally unusable (frozen). So I am not sure what you want to suggest? A correct fix would be to implement an async command queue inside qemu... > On May 18, 2018 at 7:13 AM Alexandre DERUMIER wrote: > > > Seem to be introduced a long time ago in 2012 > > https://git.proxmox.com/?p=qemu-server.git;a=blobdiff;f=PVE/QMPClient.pm;h=9829986ae77e82d340974e4d4128741ef85b4a0e;hp=d026f4d4c3012203d96660a311b1890e84e6aa18;hb=6d04217600f2145ee80d5d62231b8ade34f2e5ff;hpb=037a97463447b06ebf79a7f1d40c596d9955acee > > > previously, connect timeout was 1s. > > I think we don't have qga support at this time. Not sure why it's have been > increased for qmp command ? > > > (with 1s, it's working fine if qga agent is down). > > > > ----- Mail original - > De: "aderumier" > À: "pve-devel" > Envoyé: Vendredi 18 Mai 2018 00:37:30 > Objet: Re: [pve-devel] pvedaemon hanging because of qga retry > > in qmpclient : open_connection > > for (;;) { > $count++; > $fh = IO::Socket::UNIX->new(Peer => $sname, Blocking => 0, Timeout => 1); > last if $fh; > if ($! != EINTR && $! != EAGAIN) { > die "unable to connect to VM $vmid $sotype socket - $!\n"; > } > my $elapsed = tv_interval($starttime, [gettimeofday]); > if ($elapsed >= $timeout) { > die "unable to connect to VM $vmid $sotype socket - timeout after $count > retries\n"; > } > usleep(10); > } > > > we use $elapsed >= $timeout. > > Isn't this timeout for command execution time and not connect time ? > > I'm seeing at the end: > $self->{mux}->set_timeout($fh, $timeout); > > seem to be the command execution time in the muxer > > > > > > - Mail original - > De: "Alexandre Derumier" > À: "pve-devel" > Envoyé: Jeudi 17 Mai 2018 23:16:36 > Objet: [pve-devel] pvedaemon hanging because of qga retry > > Hi, > I had a strange behaviour today, > > with a vm running + qga enabled, but qga service down in the vm > > after theses attempts, > > May 17 21:54:01 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 > qmp > command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket > - timeout after 101 retries > May 17 21:55:10 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 > qmp > command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket > - timeout after 101 retries > > > some api request give 596 errors, mainly for the 745 vm > (/api2/json/nodes/kvm14/qemu/745/status/current), > but also for the server kvm14 on /api2/json/nodes/kvm14/qemu > > > restarting the pvedaemon have fixed the problem > > 10.59.100.141 - root@pam [17/05/2018:21:53:51 +0200] "POST > /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - > 10.59.100.141 - root@pam [17/05/2018:21:55:00 +0200] "POST > /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - > 10.59.100.141 - root@pam [17/05/2018:22:01:28 +0200] "POST > /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - > 10.3.99.10 - root@pam [17/05/2018:22:01:30 +0200] "GET > /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - > 10.59.100.141 - root@pam [17/05/2018:22:02:21 +0200] "GET > /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - > 10.3.99.10 - root@pam [17/05/2018:22:03:05 +0200] "GET > /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - > 10.59.100.141 - root@pam [17/05/2018:22:03:32 +0200] "GET > /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - > 10.3.99.10 - root@pam [17/05/2018:22:04:40 +0200] "GET > /api2/json/nodes/kvm14/qemu/745/status/
Re: [pve-devel] pvedaemon hanging because of qga retry
If you simply skip commands like 'guest-fsfreeze-thaw' your VM will get totally unusable (frozen). So I am not sure what you want to suggest? A correct fix would be to implement an async command queue inside qemu... > On May 18, 2018 at 7:13 AM Alexandre DERUMIER wrote: > > > Seem to be introduced a long time ago in 2012 > > https://git.proxmox.com/?p=qemu-server.git;a=blobdiff;f=PVE/QMPClient.pm;h=9829986ae77e82d340974e4d4128741ef85b4a0e;hp=d026f4d4c3012203d96660a311b1890e84e6aa18;hb=6d04217600f2145ee80d5d62231b8ade34f2e5ff;hpb=037a97463447b06ebf79a7f1d40c596d9955acee > > previously, connect timeout was 1s. > > I think we don't have qga support at this time. Not sure why it's have been > increased for qmp command ? > > > (with 1s, it's working fine if qga agent is down). > > > > - Mail original - > De: "aderumier" > À: "pve-devel" > Envoyé: Vendredi 18 Mai 2018 00:37:30 > Objet: Re: [pve-devel] pvedaemon hanging because of qga retry > > in qmpclient : open_connection > > for (;;) { > $count++; > $fh = IO::Socket::UNIX->new(Peer => $sname, Blocking => 0, Timeout => 1); > last if $fh; > if ($! != EINTR && $! != EAGAIN) { > die "unable to connect to VM $vmid $sotype socket - $!\n"; > } > my $elapsed = tv_interval($starttime, [gettimeofday]); > if ($elapsed >= $timeout) { > die "unable to connect to VM $vmid $sotype socket - timeout after $count > retries\n"; > } > usleep(10); > } > > > we use $elapsed >= $timeout. > > Isn't this timeout for command execution time and not connect time ? > > I'm seeing at the end: > $self->{mux}->set_timeout($fh, $timeout); > > seem to be the command execution time in the muxer > > > > > > - Mail original - > De: "Alexandre Derumier" > À: "pve-devel" > Envoyé: Jeudi 17 Mai 2018 23:16:36 > Objet: [pve-devel] pvedaemon hanging because of qga retry > > Hi, > I had a strange behaviour today, > > with a vm running + qga enabled, but qga service down in the vm > > after theses attempts, > > May 17 21:54:01 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp > command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket > - timeout after 101 retries > May 17 21:55:10 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp > command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket > - timeout after 101 retries > > > some api request give 596 errors, mainly for the 745 vm > (/api2/json/nodes/kvm14/qemu/745/status/current), > but also for the server kvm14 on /api2/json/nodes/kvm14/qemu > > > restarting the pvedaemon have fixed the problem > > 10.59.100.141 - root@pam [17/05/2018:21:53:51 +0200] "POST > /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - > 10.59.100.141 - root@pam [17/05/2018:21:55:00 +0200] "POST > /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - > 10.59.100.141 - root@pam [17/05/2018:22:01:28 +0200] "POST > /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - > 10.3.99.10 - root@pam [17/05/2018:22:01:30 +0200] "GET > /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - > 10.59.100.141 - root@pam [17/05/2018:22:02:21 +0200] "GET > /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - > 10.3.99.10 - root@pam [17/05/2018:22:03:05 +0200] "GET > /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - > 10.59.100.141 - root@pam [17/05/2018:22:03:32 +0200] "GET > /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - > 10.3.99.10 - root@pam [17/05/2018:22:04:40 +0200] "GET > /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - > 10.59.100.141 - root@pam [17/05/2018:22:05:01 +0200] "GET > /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - > 10.59.100.141 - root@pam [17/05/2018:22:05:59 +0200] "GET > /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - > 10.3.99.10 - root@pam [17/05/2018:22:06:15 +0200] "GET > /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - > 10.3.99.10 - root@pam [17/05/2018:22:07:50 +0200] "GET > /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - > 10.3.99.10 - root@pam [17/05/2018:22:09:25 +0200] "GET > /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - > 10.3.99.10 - root@pam [17/05/2018:22:11:00 +0200] "GET > /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - > 10.3.99.10 - root@pam [17/0
Re: [pve-devel] pvedaemon hanging because of qga retry
> we use $elapsed >= $timeout. > > Isn't this timeout for command execution time and not connect time ? > > I'm seeing at the end: > $self->{mux}->set_timeout($fh, $timeout); > > seem to be the command execution time in the muxer > I guess both should be shorter than $timeout? ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pvedaemon hanging because of qga retry
Seem to be introduced a long time ago in 2012 https://git.proxmox.com/?p=qemu-server.git;a=blobdiff;f=PVE/QMPClient.pm;h=9829986ae77e82d340974e4d4128741ef85b4a0e;hp=d026f4d4c3012203d96660a311b1890e84e6aa18;hb=6d04217600f2145ee80d5d62231b8ade34f2e5ff;hpb=037a97463447b06ebf79a7f1d40c596d9955acee previously, connect timeout was 1s. I think we don't have qga support at this time. Not sure why it's have been increased for qmp command ? (with 1s, it's working fine if qga agent is down). - Mail original - De: "aderumier" À: "pve-devel" Envoyé: Vendredi 18 Mai 2018 00:37:30 Objet: Re: [pve-devel] pvedaemon hanging because of qga retry in qmpclient : open_connection for (;;) { $count++; $fh = IO::Socket::UNIX->new(Peer => $sname, Blocking => 0, Timeout => 1); last if $fh; if ($! != EINTR && $! != EAGAIN) { die "unable to connect to VM $vmid $sotype socket - $!\n"; } my $elapsed = tv_interval($starttime, [gettimeofday]); if ($elapsed >= $timeout) { die "unable to connect to VM $vmid $sotype socket - timeout after $count retries\n"; } usleep(10); } we use $elapsed >= $timeout. Isn't this timeout for command execution time and not connect time ? I'm seeing at the end: $self->{mux}->set_timeout($fh, $timeout); seem to be the command execution time in the muxer - Mail original - De: "Alexandre Derumier" À: "pve-devel" Envoyé: Jeudi 17 Mai 2018 23:16:36 Objet: [pve-devel] pvedaemon hanging because of qga retry Hi, I had a strange behaviour today, with a vm running + qga enabled, but qga service down in the vm after theses attempts, May 17 21:54:01 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket - timeout after 101 retries May 17 21:55:10 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket - timeout after 101 retries some api request give 596 errors, mainly for the 745 vm (/api2/json/nodes/kvm14/qemu/745/status/current), but also for the server kvm14 on /api2/json/nodes/kvm14/qemu restarting the pvedaemon have fixed the problem 10.59.100.141 - root@pam [17/05/2018:21:53:51 +0200] "POST /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:21:55:00 +0200] "POST /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:22:01:28 +0200] "POST /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:01:30 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:22:02:21 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:03:05 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:22:03:32 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:04:40 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:22:05:01 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:22:05:59 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:06:15 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:07:50 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:09:25 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:11:00 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:12:35 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:22:14:19 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:15:44 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:17:19 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:18:54 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:20:29 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:22:04 +0200] "GET /api2/json/nodes/kvm14/
Re: [pve-devel] pvedaemon hanging because of qga retry
in qmpclient : open_connection for (;;) { $count++; $fh = IO::Socket::UNIX->new(Peer => $sname, Blocking => 0, Timeout => 1); last if $fh; if ($! != EINTR && $! != EAGAIN) { die "unable to connect to VM $vmid $sotype socket - $!\n"; } my $elapsed = tv_interval($starttime, [gettimeofday]); if ($elapsed >= $timeout) { die "unable to connect to VM $vmid $sotype socket - timeout after $count retries\n"; } usleep(10); } we use $elapsed >= $timeout. Isn't this timeout for command execution time and not connect time ? I'm seeing at the end: $self->{mux}->set_timeout($fh, $timeout); seem to be the command execution time in the muxer - Mail original - De: "Alexandre Derumier" À: "pve-devel" Envoyé: Jeudi 17 Mai 2018 23:16:36 Objet: [pve-devel] pvedaemon hanging because of qga retry Hi, I had a strange behaviour today, with a vm running + qga enabled, but qga service down in the vm after theses attempts, May 17 21:54:01 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket - timeout after 101 retries May 17 21:55:10 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket - timeout after 101 retries some api request give 596 errors, mainly for the 745 vm (/api2/json/nodes/kvm14/qemu/745/status/current), but also for the server kvm14 on /api2/json/nodes/kvm14/qemu restarting the pvedaemon have fixed the problem 10.59.100.141 - root@pam [17/05/2018:21:53:51 +0200] "POST /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:21:55:00 +0200] "POST /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:22:01:28 +0200] "POST /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:01:30 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:22:02:21 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:03:05 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:22:03:32 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:04:40 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:22:05:01 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:22:05:59 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:06:15 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:07:50 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:09:25 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:11:00 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:12:35 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.59.100.141 - root@pam [17/05/2018:22:14:19 +0200] "GET /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:15:44 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:17:19 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:18:54 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:20:29 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:22:04 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:23:39 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:25:14 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:26:49 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:28:24 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:29:59 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:31:34 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.10 - root@pam [17/05/2018:22:34:44 +0200] "GET /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 10.3.99.18 - root@pam [17/05/2018:22:35: