[slurm-dev] Multinode setup trouble

2017-05-16 Thread Ben Mann
Hello Slurm dev, I just set up a small test cluster on two Ubuntu 14.04 machines, installed SLURM 17.02 from source. I started slurmctld, slurmdbd and slurmd on a master and just slurmd on a slave. When I run a job on two nodes, it completes instantly on master, but never on slave. Here are my

[slurm-dev] Re: KNL node down after reboot

2017-05-16 Thread Costin Caramarcu
Hi, A few suggestions: 1) Try increasing the timeouts: SlurmctldTimeout=600 SlurmdTimeout=600 ResumeTimeout=600 2) Make sure that when slurm starts the node finished mounting file systems and the whole boot procedure is done, Regards, Costin On Tue, May 16, 2017 at

[slurm-dev] Re: KNL node down after reboot

2017-05-16 Thread Ryan Novosielski
SLURM has worked this way as long as I can remember. If you don't use scontrol reboot_nodes, nodes are "down" when they come back because SLURM wasn't notified about the reboot. This is configurable in slurm.conf. From: nico.faer...@id.unibe.ch

[slurm-dev] KNL node down after reboot

2017-05-16 Thread nico.faerber
Hi, We want to introduce Intel Knights Landing (KNL) nodes into our cluster, and observed the following problem: Node reboots successfully with desired NUMA and MCDRAM modes, but remains in state down. From slurmctl.log: (…)[ 2017-05-16T15:16:10.437] _update_node_avail_features: nodes

[slurm-dev] Re: Is there anyway to commit job with different user?

2017-05-16 Thread John Hearns
Sun, as the others have responded, you should make sure your userids are the same across the cluster. You really must put in the effort to do that. However - SGE does have a usermapping feature https://linux.die.net/man/5/sge_usermapping I do not know if there is somethig similar in Slurm.

[slurm-dev] Re: Is there anyway to commit job with different user?

2017-05-16 Thread E.S. Rosenberg
On Tue, May 16, 2017 at 11:39 AM, Sun Chenggen wrote: > Yes, user on my cluster synchronized, but I want to submit job on my > client machine, not on the cluster. > So only if you synchronize, for instance by making sure your UID/GID on your client matches your UID/GID

[slurm-dev] Re: Adjusting MaxJobCount and SlurmctldPort settings

2017-05-16 Thread gilles
Mark, hanging SlurmctldPort will not help with the queue limit anyway, so i suggest you update it (if really needed) during a maintenance when no job is running Cheers, Gilles - Original Message - Hello, Changing slurmctld port should probably wait until all jobs have

[slurm-dev] Re: Adjusting MaxJobCount and SlurmctldPort settings

2017-05-16 Thread Douglas Jacobsen
Hello, Changing slurmctld port should probably wait until all jobs have stopped running. Running jobs won't fail in this case, but there is a good chance they will fail to complete properly, and the compute node operating them might get stuck in the completing state (since the slurmstepd

[slurm-dev] Adjusting MaxJobCount and SlurmctldPort settings

2017-05-16 Thread Mark S. Holliman
Hi everyone, Does anyone know if changing the slurmctld settings for MaxJobCount and SlurmctldPort will cause jobs already running/waiting to fail? My users have hit the default 10,000 queue limit, and I'd like to increase that, but not if it's going to kill everything that's running. I

[slurm-dev] Re: How to get pids of a job

2017-05-16 Thread Bjørn-Helge Mevik
GHui writes: > The command "scontrol show jobs jobid" will show the nodes. And then > then command "ssh nodes scontrol listpids jobid" wiil show the pids. > > But this is a little complex. Is there more simple command, like LSF bjobs, > show the nodes and the pids. Not that I

[slurm-dev] Re: Is there anyway to commit job with different user?

2017-05-16 Thread Felip Moll
It is not possible, at least in a supported way. The first requirement of the admin guide tells: 1. Make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster. From: https://slurm.schedmd.com/quickstart_admin.html *--Felip Moll Marquès* Computer

[slurm-dev] Is there anyway to commit job with different user?

2017-05-16 Thread Sun Chenggen
Hi everyone: Is there anyway to commit job with different user? My slurm cluster doesn’t have the same user config as my local slurm-client machine. If I commit job on my local machine , it failed with the message “srun: error: Application launch failed: User not found on host”. Do I have to