As we've scaled up our slurm usage, we've noticed that short, moderate bursts 
of DNS lookup failures are enough to regularly stall slurmctld:

_xgetaddrinfo: getaddrinfo(runner-t35knco5d-project-54-concurrent-0:37251) 
failed

...this has a cascading effect where, when stalled, the controlled can't always 
communicate with nodes:

error: Error connecting, bad data: family = 0, port = 0

...and immediately the controller will mark the nodes as unhealthy, and kill 
jobs:

slurmctld: Killing JobId=3120751 on failed node slurm-0f6cacdc1

The reason for the DNS failures is not unreliable DNS server or network, but 
rather the jobs are submitted by containers that don't have resolvable 
hostnames. This traditionally hasn't disrupted functionality, but we've noticed 
if 8-10 jobs all terminate at the same time (submitter container SIGTERMs the 
srun process) that the controller can be easily overloaded for several seconds, 
despite having significant free system resources. gdb confirms the process is 
hanging on DNS. We also can see "Socket timed out on send/recv operation" from 
clients attempting to interact with the controller during the issue.

slurm 24.11.0
RHEL 8.10 kernel 4.18.0-553.58.1.el8_10.x86_64

We're looking into ways to get our ephemeral job submitter containers 
resolvable in DNS to prevent lookup failures (either by giving them resolvable 
hostnames, or by blackholing the records to 0.0.0.0 to allow for fast local 
failure on the slurmctld server). However, it does seem unusual for a handful 
of bad DNS lookups to cause so much disruption in slurmctld.

Is this a known weak point of ctld? The slurmctld host is a single-purpose 16 
vCPU 30GB EC2 instance with minimal load. We have ~150 nodes, all nodes have 
valid IPs in slurm.conf to remove the need for ctld to perform lookups for 
nodes, but apparently there is still a need to lookup the submit host as well, 
and we can reliably reproduce these cascading failures.

Another possiblity might be to extend SlurmdTimeout to something very long and 
hope that the controller recovers from its stall in enough time to prevent from 
marking nodes as unhealthy and killing jobs, but it's not clear if that will 
have any effect since the first occurrence of "error: Error connecting, bad 
data: family = 0, port = 0" immediately drains nodes and kills jobs.

Thanks

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to