Hello Slurm dev,
I just set up a small test cluster on two Ubuntu 14.04 machines, installed
SLURM 17.02 from source. I started slurmctld, slurmdbd and slurmd on a
master and just slurmd on a slave. When I run a job on two nodes, it
completes instantly on master, but never on slave.
Here are my
Hi,
A few suggestions:
1) Try increasing the timeouts:
SlurmctldTimeout=600
SlurmdTimeout=600
ResumeTimeout=600
2) Make sure that when slurm starts the node finished mounting file
systems and the whole boot procedure is done,
Regards,
Costin
On Tue, May 16, 2017 at
SLURM has worked this way as long as I can remember. If you don't use scontrol
reboot_nodes, nodes are "down" when they come back because SLURM wasn't
notified about the reboot. This is configurable in slurm.conf.
From: nico.faer...@id.unibe.ch
Hi,
We want to introduce Intel Knights Landing (KNL) nodes into our cluster, and
observed the following problem: Node reboots successfully with desired NUMA and
MCDRAM modes, but remains in state down.
From slurmctl.log:
(…)[
2017-05-16T15:16:10.437] _update_node_avail_features: nodes
Sun,
as the others have responded, you should make sure your userids are the
same across the cluster.
You really must put in the effort to do that.
However - SGE does have a usermapping feature
https://linux.die.net/man/5/sge_usermapping
I do not know if there is somethig similar in Slurm.
On Tue, May 16, 2017 at 11:39 AM, Sun Chenggen
wrote:
> Yes, user on my cluster synchronized, but I want to submit job on my
> client machine, not on the cluster.
>
So only if you synchronize, for instance by making sure your UID/GID on
your client matches your UID/GID
Mark,
hanging SlurmctldPort will not help with the queue limit anyway, so i
suggest you update it (if really needed) during a maintenance when no
job is running
Cheers,
Gilles
- Original Message -
Hello,
Changing slurmctld port should probably wait until all jobs have
Hello,
Changing slurmctld port should probably wait until all jobs have stopped
running. Running jobs won't fail in this case, but there is a good chance
they will fail to complete properly, and the compute node operating them
might get stuck in the completing state (since the slurmstepd
Hi everyone,
Does anyone know if changing the slurmctld settings for MaxJobCount and
SlurmctldPort will cause jobs already running/waiting to fail? My users have
hit the default 10,000 queue limit, and I'd like to increase that, but not if
it's going to kill everything that's running. I
GHui writes:
> The command "scontrol show jobs jobid" will show the nodes. And then
> then command "ssh nodes scontrol listpids jobid" wiil show the pids.
>
> But this is a little complex. Is there more simple command, like LSF bjobs,
> show the nodes and the pids.
Not that I
It is not possible, at least in a supported way.
The first requirement of the admin guide tells:
1. Make sure the clocks, users and groups (UIDs and GIDs) are
synchronized across the cluster.
From:
https://slurm.schedmd.com/quickstart_admin.html
*--Felip Moll Marquès*
Computer
Hi everyone:
Is there anyway to commit job with different user? My slurm cluster doesn’t
have the same user config as my local slurm-client machine. If I commit job on
my local machine , it failed with the message “srun: error: Application launch
failed: User not found on host”.
Do I have to
12 matches
Mail list logo