On 2023-01-11 16:23, Vladislav Bogdanov wrote:
And, one more thing can affect that - selinux. I doubt, but that's worth checking.

selinux is permissive, and nothing written to audit.log. Side note; I checked GID and effective GID and well as UID and EUID, all 0.

I recorded environment variables, and removed from below matching ones. Here's the differences;

====
-=] Direct
Environment: [_] -> [/usr/lib/ocf/resource.d/alteeve/server]

-=] crm_resource
Environment: [HA_debug] -> [0]
Environment: [HA_logfacility] -> [none]
Environment: [OCF_EXIT_REASON_PREFIX] -> [ocf-exit-reason:]
Environment: [OCF_OUTPUT_FORMAT] -> [xml]
Environment: [OCF_RA_VERSION_MAJOR] -> [1]
Environment: [OCF_RA_VERSION_MINOR] -> [1]
Environment: [OCF_RESKEY_CRM_meta_timeout] -> [20000]
Environment: [OCF_RESKEY_crm_feature_set] -> [3.16.2]
Environment: [OCF_RESKEY_name] -> [srv04-test]
Environment: [OCF_RESOURCE_INSTANCE] -> [test]
Environment: [OCF_RESOURCE_PROVIDER] -> [alteeve]
Environment: [OCF_RESOURCE_TYPE] -> [server]
Environment: [OCF_ROOT] -> [/usr/lib/ocf]
Environment: [OCF_TRACE_FILE] -> [/dev/stderr]
Environment: [PCMK_logfacility] -> [none]
Environment: [PCMK_service] -> [crm_resource]
Environment: [_] -> [/usr/sbin/crm_resource]
====


Vladislav Bogdanov <bub...@hoster-ok.com> 11 января 2023 г. 22:21:03 написал:

Then I would suggest to log all env vars and compare them, probably something is missing in validate for virsh to be happy.

Madison Kelly <mke...@alteeve.com> 11 января 2023 г. 22:06:45 написал:

On 2023-01-11 01:13, Vladislav Bogdanov wrote:
I suspect that valudate action is run as a non-root user.

I modified the script to log the real and effective UIDs and it's
running as root in both instances.

Madison Kelly <mke...@alteeve.com> 11 января 2023 г. 07:06:55 написал:

On 2023-01-11 00:21, Madison Kelly wrote:
On 2023-01-11 00:14, Madison Kelly wrote:
Hi all,

Edit: Last message was in HTML format, sorry about that.

   I've got a hell of a weird problem, and I am absolutely stumped on
what's going on.

   The short of it is; if my RA is called from the command line, it's
fine. If a resource exists, monitor, enable, disable, all that stuff
works just fine. If I try to create a resource, it hangs on the
validate stage. Specifically, it hangs when 'pcs' calls:

crm_resource --validate --output-as xml --class ocf --agent server
--provider alteeve --option name=<resource_name>

   Specifically, it hangs when it tries to make a shell call (to
virsh, specifically, but that doesn't matter). So to debug, I started
stripping down my RA simpler and simpler until I was left with the
very most basic of programs;

https://pastebin.com/VtSpkwMr

   That is literally the simplest program I could write that made the
shell call. The 'open()' call is where it hangs.

When I call directly;

time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
srv04-test; echo rc:$?

====
real    0m0.061s
user    0m0.037s
sys    0m0.014s
rc:0
====

It's just fine. I can see in the log the output from the 'virsh' call
as well. However, when I call from crm_resource;

time crm_resource --validate --output-as xml --class ocf --agent
server --provider alteeve --option name=srv04-test; echo rc:$?

====
<pacemaker-result api-version="2.25" request="crm_resource --validate
--output-as xml --class ocf --agent server --provider alteeve --option
name=srv04-test">
   <resource-agent-action action="validate" class="ocf" type="server"
provider="alteeve">
     <overrides/>
     <agent-status code="1" message="error" execution_code="2"
execution_message="Timed Out" reason="Resource agent did not exit
within specified timeout"/>
   </resource-agent-action>
   <status code="1" message="Error occurred">
     <errors>
       <error>crm_resource: Error performing operation: Error
occurred</error>
     </errors>
   </status>
</pacemaker-result>

real    0m20.521s
user    0m0.022s
sys    0m0.010s
rc:1
====

In the log file, I see (from line 20 of the super-simple-test-script):

====
Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
/usr/bin/echo return_code:0 |]
====

Then nothing else.

The strace output is: https://pastebin.com/raw/UCEUdBeP

Environment;

* selinux is permissive
* Pacemaker 2.1.5-4.el8
* pcs 0.10.15
* 4.18.0-408.el8.x86_64
* CentOS Stream release 8

Any help is appreciated, I am stumped. :/

After sending this, I tried having my "RA" call 'hostname', and that
worked fine. I switched back to 'virsh list --all', and that hangs. So
it seems to somehow be related to call 'virsh' specifically.


OK, so more info... Knowing now that it's a problem with the virsh call
specifically (but only when validating, existing VMs monitor, enable,
disable fine, all which repeatedly call virsh), I noticed a few things.

First, I see in the logs:

====
Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
Connection reset by peer
====

So with this, I further simplified my test script to this:

https://pastebin.com/Ey8FdL1t

Then when I ran my test script directly, the strace output is:

Good: https://pastebin.com/Trbq67ub

When my script is called via crm_resource, the strace is this:

Bad: https://pastebin.com/jtbzHrUM

The first difference I can see happens around line 929 in the good
paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
exists, which doesn't in the bad paste. Shortly after, I start seeing:

====
line: [write(4, "\1\0\0\0\0\0\0\0", 8)         = 8]
line: [brk(NULL)                               = 0x562b7877d000]
line: [brk(0x562b787aa000)                     = 0x562b787aa000]
line: [write(4, "\1\0\0\0\0\0\0\0", 8)         = 8]
====

Around line 959 in the bad paste. There are more brk() lines, and not
long after the output stops.

--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users <https://lists.clusterlabs.org/mailman/listinfo/users>

ClusterLabs home: https://www.clusterlabs.org/ <https://www.clusterlabs.org/>



--
Madison Kelly
Alteeve's Niche!
Chief Technical Officer
c: +1-647-471-0951
https://alteeve.com/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to