[slurm-users] Re: Configless Slurm Error: failed to fetch remote configs

Xaver Stiensmeier via slurm-users Fri, 30 Jan 2026 00:27:57 -0800

Hey Ole,

I apologize for the late reply.

I receive nothing from `dig +short +search +ndots=2 -t SRV -n_slurmctld._tcp`, but shouldn't setting


   cat /etc/systemd/system/slurmd.service.d/override.conf
   # Override systemd service to set conditional path
   # Type=simple
   [Service]
   ExecStart=
   ExecStart=/usr/sbin/slurmd --conf-server=master

 be enough given the documentation?

   The *--conf-server* options takes precedence over the DNS record.

But I think that you are right and somehow Slurm ignores the ip whenType is not simple. I am confused.


     nc -vz master 6817
   Connection to master (192.168.20.41) 6817 port [tcp/*] succeeded!

works and starting it later either via command line or via Type=simpleworks, too. I will try to dig deeper into the logs to maybe see whetherthe parameter gets skipped somehow, but still appreciate any help.


Best,
Xaver

On 1/20/26 15:31, Ole Holm Nielsen via slurm-users wrote:

Hi Xaver,

Are you sure that your DNS SRV record is responding?

$ dig +short +search +ndots=2 -t SRV -n _slurmctld._tcp
Seehttps://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#testing-configless-setup
Best regards,
Ole

On 1/20/26 15:26, Xaver Stiensmeier wrote:
Hey Ole,
we currently use Slurm 24.11 regarding the build process I have toget in touch with our Cloud admins as we build it and then offer itvia mirror. However, I can confirm that it all worked without errorbefore using configless.
The network is definitely already up as slurmd restarting does nothelp. However, I noticed that in these cases slurmd fails VERYquickly; it definitely does not wait for any timeout.
I primarily mentioned Ansible to support that I am pretty sure thatthe system is set up the same as before using configless.
Best,
Xaver

On 1/20/26 13:39, Ole Holm Nielsen wrote:
Hi Xaver,
I have no experience with Ubuntu systems, which may behavedifferently from our RockyLinux 8. Setting up Slurm with Ansibleshould be fine, and this is also how we configure our Slurm serversand login nodes (but not slurmd nodes). Once Ansible is finishedthe system ought to work.
Did you build your Slurm packages with the Debian build system, see
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fquickstart_admin.html%23debuild&data=05%7C02%7COle.H.Nielsen%40fysik.dtu.dk%7C3f17379c9c164ab679aa08de582fe9b4%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639045160121292737%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=rrMN3rrxqAnsqJyjTRrdezheYDkJDsy%2BNhfGhEZxIHE%3D&reserved=0
Do you run a recent Slurm version (24.11 and later are currentlysupported)?
I wonder if the error:
error: _fetch_child: failed to fetch remote configs: Protocolauthentication error
is due to the network not yet being up after reboot? Restartingslurmd manually should hopefully work.
IHTH,
Ole

On 1/19/26 14:48, Xaver Stiensmeier wrote:
Hey Ole,
thank you so much for your in detail documentation which leaves meboth with answers and questions. Apparently, the aforementionederror had nothing to do with munge but with some issues regardingthe reload of slurmd which I can't really reproduce. I think Isomehow had two running and only killed one, but this is difficultto tell, because once I redid the entire setup, half the issuedisappeared.
The remaining issue is that Slurmd can't start via systemctl asSlurmd never notifies systemctl that it is ready. I was able to fixthis by setting:
    [Service]
    Type=simple
which allows the start and then Slurm is able to reach the node,config files are pulled as expected and I can schedule commands onthe node.
While this leaves me with a running system, I still get:

    ubuntu@worker:~$ systemctl status slurmd.service
    ○ slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service;enabled;
    preset: enabled)
         Drop-In: /etc/systemd/system/slurmd.service.d
                  └─override.conf
Active: inactive (dead) since Mon 2026-01-19 13:31:28UTC; 8min ago
        Duration: 7ms
Process: 19712 ExecStart=/usr/sbin/slurmd--conf-server=master
    (code=exited, status=0/SUCCESS)
        Main PID: 19712 (code=exited, status=0/SUCCESS)
           Tasks: 11 (limit: 19147)
          Memory: 4.2M (peak: 6.4M)
             CPU: 110ms
          CGroup: /system.slice/slurmd.service
                  └─19714 /usr/sbin/slurmd --conf-server=master
Jan 19 13:31:28 worker systemd[1]: Started slurmd.service -Slurm node
    daemon.
    Jan 19 13:31:28 worker systemd[1]: slurmd.service: Deactivated
    successfully.
Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process19713
    (slurmd) remains running after unit stopped.
Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process19714
    (slurmd) remains running after unit stopped.
Jan 19 13:31:28 worker slurmd[19716]: error: _fetch_child:failed to
    fetch remote configs: Protocol authentication error
Jan 19 13:31:28 worker slurmd[19714]: error:_establish_configuration:
    failed to load configs. Retrying in 10 seconds.
This leaves me with the guess that the initial fail that thensucceeds might cause systemctl to abort early. Note that we setupour Slurm cluster via Ansible scripts so there might also be a racecondition I am overlooking that causes parts of the authenticationnot being ready; however, this was not an issue before we triedconfigless.
Best,
Xaver

On 1/16/26 12:11, Ole Holm Nielsen via slurm-users wrote:
Hi Xaver,
We have been running Configless Slurm for a number of years, andwe're very happy with this setup. I have documented all thedetailed configurations we made in this Wiki page, so maybe youwant to consult this page:
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.fysik.dtu.dk%2FNiflheim_system%2FSlurm_configuration%2F&data=05%7C02%7COle.H.Nielsen%40fysik.dtu.dk%7C3f17379c9c164ab679aa08de582fe9b4%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639045160121318015%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=1BibGElGI5AUTHIiamcUkSeq%2Bz%2BQzhR29KgkHE3uMHE%3D&reserved=0#configless-slurm-setup
IHTH,
Ole

On 1/16/26 11:11, Xaver Stiensmeier via slurm-users wrote:
Hey everyone,
in the past we set up clusters with configs on each node. Now wewant to explore configless. Without changing anything else, wetherefore followed:https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fconfigless_slurm.html&data=05%7C02%7COle.H.Nielsen%40fysik.dtu.dk%7C3f17379c9c164ab679aa08de582fe9b4%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639045160121334630%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=IPdH9dJMNc8660bV6sw1BKXme%2BTEQq2AQWHZfKbkNNg%3D&reserved=0and added 'enable_configless' in the config on the master:
SlurmctldParameters=cloud_dns,idle_on_node_suspend,enable_configless,reconfig_on_restart
and start each worker's slurmd with the conf-server parameter:

    # Override systemd service to set conditional path
    [Service]
    ExecStart=
    ExecStart=/usr/sbin/slurmd --conf-server=master

However, this leads to:
slurmd: error: _fetch_child: failed to fetch remote configs:Protocol
    authentication error

    slurmd: error: _establish_configuration: failed to load configs.
    Retrying in 10 seconds.

on the workers and on the master (/var/log/slurm/slurmctld) to:
[2026-01-16T10:00:06.681] error: Munge decode failed: Invalidcredential [2026-01-16T10:00:06.681] auth/munge: _print_cred: ENCODED:Thu Jan 01
    00:00:00 1970
[2026-01-16T10:00:06.681] auth/munge: _print_cred: DECODED:Thu Jan 01
    00:00:00 1970
    [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:
[[worker]:24295] auth_g_verify: REQUEST_CONFIG hasauthentication
    error: Unspecified error
    [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:
    [[worker]:24295] Protocol authentication error
The munge key setup is the same as before so I don't think thereis anything wrong with it unless something changes withconfigless (slurm.conf):
    AuthType=auth/munge
    CryptoType=crypto/munge
    AuthAltTypes=auth/jwt
    AuthAltParameters=jwt_key=/etc/slurm/jwt-secret.key
I found https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fg%2Fslurm-users%2Fc%2FQ7FVkhx-bOs&data=05%7C02%7COle.H.Nielsen%40fysik.dtu.dk%7C3f17379c9c164ab679aa08de582fe9b4%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639045160121350443%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=XdW3vfYmxo7IB32fOvxxC%2BAf1oD391CnKegMsvHTZlE%3D&reserved=0but this seems unrelated as both can talk fine with each other:
worker:~$ nc -zv master 6817
Connection to master (192.168.20.169) 6817 port [tcp/*]succeeded!
I tried adding more "-v" to the slurmd start, but that did notgive more information. I am unsure how to debug this further.Somehow I think it must be a munge issue, but I am confused asthis part hasn't changed.
Best regards,
Xaver

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Configless Slurm Error: failed to fetch remote configs

Reply via email to