Re: Problem calling 'virsh' in a script

2022-05-15 Thread Digimer

  
  
On 2022-05-15 12:13, Digimer wrote:


  
  On 2022-05-15 12:07, Laine Stump
wrote:
  
  On
5/15/22 11:48 AM, Digimer wrote: 
Hi all, 
  
     I've got a series of programs that monitor various things
  on a CentOS Stream 8 VM host. All of these scripts work when
  called directly. However, when I have a parent program that
  calls all the little programs in series, I found that some
  virsh calls hang. 


Is your script being called from a libvirt "hook" script? (https://libvirt.org/hooks.html
)If so, that won't work - a libvirt hook script is called from
within libvirt, and can't call back into libvirt. 

Other than that, is there anything different about the context
the script is being run from vs. the context you're directly
running virsh from? 
  
  It's a perl script making a shell (system) call. So it's
basically;
  open (my $fh, "/usr/bin/virsh list --all |") or die;
while ($fh)
{
    chomp;
    # Do things
}
close $fh;
  
    There's about 15 programs that are sitting in a given
directory. When the parent program runs, it looks at the scripts
in the directory and runs them (again as simple shell calls),
one after the other. This is where things fail. I'm happy to
provide more detail or add debugging if you'd like.
  

I just did a test where I reversed the order that the scripts
  were called, so that the problematic one was called first (in case
  it was a connection limit being hit or something), and I had a new
  failure mode...
When the parent program ran, it hung hard. I call the child
  scripts with 'timeout 30 /path/to/child/script' and timeout never
  fired, the program hung hard. In journald, I saw:
May 15 12:18:43 nr-a03n01.nray.ca libvirtd[1643714]: internal
  error: connection closed due to keepalive timeout
  May 15 12:25:31 nr-a03n01.nray.ca libvirtd[1643714]: Cannot recv
  data: Connection reset by peer
I had to kill the parent program with two 'ctrl + c' entries;

  time scancore --run-once 
  scancore has started.
  Running the scan agent: [scan-storcli] with a timeout of: [30]
  seconds now...
  - Scan agent: [scan-storcli] exited after: [4] seconds with the
  return code: [0].
  Running the scan agent: [scan-server] with a timeout of: [30]
  seconds now...
  ^C
  
  Process with PID: [1705068] exiting on SIGINT.
  ^C
  
  Process with PID: [1705068] exiting on SIGINT.
  
  real    13m43.899s
  user    0m5.253s
  sys    0m2.097s
  
  
  I checked 'ps aux' and found that, even after the ctrl + c, the
  processes were still running...
  
  
  # scancore --run-once 
  scancore has started.
  Running the scan agent: [scan-storcli] with a timeout of: [30]
  seconds now...
  - Scan agent: [scan-storcli] exited after: [5] seconds with the
  return code: [0].
  Running the scan agent: [scan-server] with a timeout of: [30]
  seconds now...
  ^C
  
  Process with PID: [1708093] exiting on SIGINT.
  ^C
  
  Process with PID: [1708093] exiting on SIGINT.
  [root@nr-a03n01 ~]# ps aux | grep scan
  root 1708900  0.0  0.0  12732  3132 pts/1    S    12:45   0:00
  sh -c /usr/bin/timeout 30
  /usr/sbin/scancore-agents/scan-server/scan-server 2>&1;
  /usr/bin/echo return_code:$?
  root 1708901  0.0  0.0  11592   976 pts/1    S    12:45   0:00
  /usr/bin/timeout 30
  /usr/sbin/scancore-agents/scan-server/scan-server
  root 1708902  5.8  0.0 249400 91960 pts/1    T    12:45   0:01
  /usr/bin/perl /usr/sbin/scancore-agents/scan-server/scan-server
  
While this is hanging, _other_ programs call 'virsh list --all'
  just fine. And as mentioned, if I call the problem script
  directly, it runs just fine (confirmed by watching the logs,
  'virsh list --all' returns and logic runs fine)...

  [root@nr-a03n01 ~]# ps aux | grep scan
  root 1709321  0.0  0.0  12144  1108 pts/1    S+   12:47   0:00
  grep --color=auto scan

[root@nr-a03n01 ~]#
  /usr/sbin/scancore-agents/scan-server/scan-server 

[root@nr-a03n01 ~]# ps aux | grep scan
  root 1709716  0.0  0.0  12144  1112 pts/1    S+   12:48   0:00
  grep --color=auto scan
  [root@nr-a03n01 ~]# 
  
I am so confused...
digimer

  




Re: Problem calling 'virsh' in a script

2022-05-15 Thread Digimer

  
  
On 2022-05-15 12:07, Laine Stump wrote:

On
  5/15/22 11:48 AM, Digimer wrote:
  
  Hi all,


   I've got a series of programs that monitor various things on
a CentOS Stream 8 VM host. All of these scripts work when called
directly. However, when I have a parent program that calls all
the little programs in series, I found that some virsh calls
hang.

  
  
  Is your script being called from a libvirt "hook" script?
  (https://libvirt.org/hooks.html )If so, that won't work - a
  libvirt hook script is called from within libvirt, and can't call
  back into libvirt.
  
  
  Other than that, is there anything different about the context the
  script is being run from vs. the context you're directly running
  virsh from?
  

It's a perl script making a shell (system) call. So it's
  basically;
open (my $fh, "/usr/bin/virsh list --all |") or die;
  while ($fh)
  {
      chomp;
      # Do things
  }
  close $fh;

  There's about 15 programs that are sitting in a given
  directory. When the parent program runs, it looks at the scripts
  in the directory and runs them (again as simple shell calls), one
  after the other. This is where things fail. I'm happy to provide
  more detail or add debugging if you'd like.




  
  

   Initially, there were two scripts that were hanging
repeatedly. Once called 'virsh net-list --all --name', so I
changed it to check for configs in
'/etc/libvirt/qemu/networks/', and that script started working.
The other script though calls 'virsh list --all', and that can't
be easily swapped out, so I really need to find the source of
these hangs.


   Whenever the hang happens, about 30~45 seconds later, I see
'libvirtd[1643714]: Cannot recv data: Connection reset by peer'.


   I think the issue is striking other scripts that run, but
this scenario is happening predictably and consistently right
now.


   I thought it might be a concurrent connect limit or a problem
with how many times virsh is called by a script, so I wrote a
test script that kept calling 'virsh list --all' each second,
but it was close to 100 calls without hanging, far more that all
the calls in my scripts combined, so I don't think that's it.


Any advice/guidance would be very much appreciated!


-- 
Digimer
  


  




Re: Problem calling 'virsh' in a script

2022-05-15 Thread Laine Stump

On 5/15/22 11:48 AM, Digimer wrote:

Hi all,

   I've got a series of programs that monitor various things on a CentOS 
Stream 8 VM host. All of these scripts work when called directly. 
However, when I have a parent program that calls all the little programs 
in series, I found that some virsh calls hang.


Is your script being called from a libvirt "hook" script? 
(https://libvirt.org/hooks.html )If so, that won't work - a libvirt hook 
script is called from within libvirt, and can't call back into libvirt.


Other than that, is there anything different about the context the 
script is being run from vs. the context you're directly running virsh from?




   Initially, there were two scripts that were hanging repeatedly. Once 
called 'virsh net-list --all --name', so I changed it to check for 
configs in '/etc/libvirt/qemu/networks/', and that script started 
working. The other script though calls 'virsh list --all', and that 
can't be easily swapped out, so I really need to find the source of 
these hangs.


   Whenever the hang happens, about 30~45 seconds later, I see 
'libvirtd[1643714]: Cannot recv data: Connection reset by peer'.


   I think the issue is striking other scripts that run, but this 
scenario is happening predictably and consistently right now.


   I thought it might be a concurrent connect limit or a problem with 
how many times virsh is called by a script, so I wrote a test script 
that kept calling 'virsh list --all' each second, but it was close to 
100 calls without hanging, far more that all the calls in my scripts 
combined, so I don't think that's it.


Any advice/guidance would be very much appreciated!

--
Digimer
Papers and Projects:https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of Einstein’s brain 
than in the near certainty that people of equal talent have lived and died in cotton 
fields and sweatshops." - Stephen Jay Gould





Problem calling 'virsh' in a script

2022-05-15 Thread Digimer

  
  
Hi all,
  I've got a series of programs that monitor various things on a
  CentOS Stream 8 VM host. All of these scripts work when called
  directly. However, when I have a parent program that calls all the
  little programs in series, I found that some virsh calls hang.
  Initially, there were two scripts that were hanging repeatedly.
  Once called 'virsh net-list --all --name', so I changed it to
  check for configs in '/etc/libvirt/qemu/networks/', and that
  script started working. The other script though calls 'virsh list
  --all', and that can't be easily swapped out, so I really need to
  find the source of these hangs.
  Whenever the hang happens, about 30~45 seconds later, I see
  'libvirtd[1643714]: Cannot recv data: Connection reset by peer'. 

  I think the issue is striking other scripts that run, but this
  scenario is happening predictably and consistently right now. 

  I thought it might be a concurrent connect limit or a problem
  with how many times virsh is called by a script, so I wrote a test
  script that kept calling 'virsh list --all' each second, but it
  was close to 100 calls without hanging, far more that all the
  calls in my scripts combined, so I don't think that's it.
Any advice/guidance would be very much appreciated!
-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
  




Re: Updating domains definitions via API

2022-05-15 Thread Laine Stump

On 5/14/22 6:42 PM, Darragh Bailey wrote:

Hi,

On Sat 14 May 2022, 21:11 Laine Stump, > wrote:


Caveat - I'm completely unfamiliar with ruby and the libvirt-ruby API
bindings.

If there is a problem that causes the domain config to not be updated,
libvirt will return an error. So I would suspect one of the two things
is happening:


Thanks, that's what I was expecting should happen, just wanted to be 
sure that there wasn't some other behaviour in place for compatibility 
reasons.


1) there may be a problem in the libvirt-ruby bindings that causes the
error reported by the call (in whatever C code is behind the ruby
bindings) to libvirt to be properly propagated to ruby. I would hope
this isn't the case, but "bugs happen", so it should be considered as a
possibility.


A quick look suggests that the code looks to raise an exception if the 
dom pointer returned is NULL, so I think the bindings are correct. But I 
will check that what version of ruby-libvirt I have installed matches 
the source code I'm looked at.


2) As I said in my earlier mail, any changes that are made will take
effect the next time the domain is destroyed and restarted. This also
means that the changes won't be reflected in the "live/status" XML of
the domain until that time. If you want to see the new configuration
after you've made changes, you should add the VIR_DOMAIN_XML_INACTIVE
flag when requesting the domain XML. Possibly you haven't included this
flag, and that's why you think that your change hasn't taken effect?


Ah, I forgot to outline where in the lifecycle the update is taking 
place. The domain isn't running when the code attempts to update the 
definition.


Does that still mean that the VIR_DOMAIN_XML_INACTIVE flag is needed? I 
was assuming when the domain is inactive the XML changes would be 
reflected immediately.


No, your thinking was correct - if the domain isn't active, then the 
change should take effect immediately, and there is no difference 
whether or not you have VIR_DOMAIN_XML_INACTIVE.


I've never done anything directly with the nvram setting (just accepted 
whatever virt-manager put in there), but from your other message, it 
sounds like you've found a bonafide libvirt bug (either that, or I just 
don't know enough about how the nvram settings work :-)). Can you file 
an issue at https://gitlab.com/groups/libvirt/-/issues ?




Oddly I thought during some experiments when the added NVRAM XML element 
was ignored, the updated number of CPUs which was in the same XML 
definition passed in was applied.


Another indication that it's a bug - updates to the domain config are 
always an all-or-nothing thing.


Will dig further tomorrow or Monday on the version of ruby-libvirt 
installed into my rvm dev env as well as checking passing in the flag.


I'm sure it'll turn out to be something obvious that I'm overlooking.

Thanks,
--
Darragh





Re: Updating domains definitions via API

2022-05-15 Thread Darragh Bailey
Hi,

On Sat, 14 May 2022 at 23:42, Darragh Bailey 
wrote:

> Hi,
>
> On Sat 14 May 2022, 21:11 Laine Stump,  wrote:
>
>> Caveat - I'm completely unfamiliar with ruby and the libvirt-ruby API
>> bindings.
>>
>> If there is a problem that causes the domain config to not be updated,
>> libvirt will return an error. So I would suspect one of the two things
>> is happening:
>>
>
Looks like I've stumbled across an edge case here regarding domain config
not being fully updated but also not returning an error.


> 1) there may be a problem in the libvirt-ruby bindings that causes the
>> error reported by the call (in whatever C code is behind the ruby
>> bindings) to libvirt to be properly propagated to ruby. I would hope
>> this isn't the case, but "bugs happen", so it should be considered as a
>> possibility.
>>
>
I decided to test the behaviour slightly more directly via virsh rather
than through the ruby bindings and based on replicating the same there, and
a review of the ruby binding API code, I believe the binding code is
working fine, the problem is unexpected behaviour in libvirt.

It appears that if the XML passed in contains an nvram XML tag without a
corresponding loader tag, then the nvram tag will be dropped without an
error. In fact if you change any other information such as the vcpu
definition in the same update, that will still be applied while the nvram
tag is ignored.

Simply the reason the ruby code isn't raising an exception is that libvirt
thinks nothing went wrong and is returning a non null pointer.

Put together a gist to make it easier to show what I'm seeing
https://gist.github.com/electrofelix/6f66714c14a0d6e3b1078037aadae398

I'm assuming at this point that this is a bug, the domain XML in
test_with_nvram.xml should be rejected because it's not all applied.
--
Darragh

>