Hi Brian, Thanks for the heads up. It turned out we had DNS issues on the compute nodes of the cluster. The suggestion to look at the slurmd log files was the right one. Error messages in the logs pointed us to the DNS problem.
Howard -- Howard Pritchard HPC-DES Los Alamos National Laboratory From: Brian Gilmer <bfgil...@gmail.com<mailto:bfgil...@gmail.com>> Reply-To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>> Date: Thursday, January 12, 2017 at 6:59 AM To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>> Subject: [slurm-dev] Re: srun job launch time issue I have seen the clocks out of sync. I use a MPI program that checks the local time after syncing on a barrier. I also use: bgilmer@jupiter:~> srun -t 1 -N90 date +%s | sort -n | uniq -c 90 1484229062 to see if there is a spread. On a small job like this there should not be any variance. As the number of tasks increase the you should begin to see a spread. bgilmer@jupiter:~> srun -t 1 -n 1024 date +%s | sort -n | uniq -c 1 1484229252 343 1484229253 278 1484229254 232 1484229255 170 1484229256 This was run on a small Cray XC. Hope this was helpful. On Wed, Jan 11, 2017 at 1:22 PM, Pritchard Jr., Howard <howa...@lanl.gov<mailto:howa...@lanl.gov>> wrote: Hi SLURM folks, I recently got SLURM (16.05.6) set up on a small cluster (48 nodes x86_64 + Intel OPA) and things appear to be nominal except for one odd performance problem as far as srun launch times go. I don’t observe this on other clusters running SLURM at our site. What I’m observing is that regardless of whether or not the application being launched is a command (e.g. /bin/hostname) or an MPI application, I get reasonable job launch times when using one node, but as soon as I use two or morenodes, there is about a 10 second overhead to get the processes on the additional nodes started: For example: [hpp@hi-master ~]$ srun -n 8 -N 1 date Wed Jan 11 12:11:29 MST 2017 Wed Jan 11 12:11:29 MST 2017 Wed Jan 11 12:11:29 MST 2017 Wed Jan 11 12:11:29 MST 2017 Wed Jan 11 12:11:29 MST 2017 Wed Jan 11 12:11:29 MST 2017 Wed Jan 11 12:11:29 MST 2017 Wed Jan 11 12:11:29 MST 2017 [hpp@hi-master ~]$ srun -n 8 -N 2 date Wed Jan 11 12:10:35 MST 2017 Wed Jan 11 12:10:35 MST 2017 Wed Jan 11 12:10:35 MST 2017 Wed Jan 11 12:10:35 MST 2017 Wed Jan 11 12:10:44 MST 2017 Wed Jan 11 12:10:44 MST 2017 Wed Jan 11 12:10:44 MST 2017 Wed Jan 11 12:10:44 MST 2017 [hpp@hi-master ~]$ srun -n 8 -N 4 date Wed Jan 11 12:10:57 MST 2017 Wed Jan 11 12:10:57 MST 2017 Wed Jan 11 12:11:07 MST 2017 Wed Jan 11 12:11:06 MST 2017 Wed Jan 11 12:11:07 MST 2017 Wed Jan 11 12:11:06 MST 2017 Wed Jan 11 12:11:07 MST 2017 Wed Jan 11 12:11:07 MST 2017 Anyone observed this problem before? Any suggestions on how to resolve this problem would be much appreciated. Thanks, Howard -- Howard Pritchard HPC-DES Los Alamos National Laboratory -- Speak when you are angry--and you will make the best speech you'll ever regret. - Laurence J. Peter