Hi Brian,

Thanks for the heads up.  It turned out we had DNS issues on the compute nodes
of the cluster.   The suggestion to look at the slurmd log files was the right 
one.
Error messages in the logs pointed us to the DNS problem.

Howard

--
Howard Pritchard
HPC-DES
Los Alamos National Laboratory


From: Brian Gilmer <bfgil...@gmail.com<mailto:bfgil...@gmail.com>>
Reply-To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Date: Thursday, January 12, 2017 at 6:59 AM
To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Subject: [slurm-dev] Re: srun job launch time issue

I have seen the clocks out of sync.  I use a MPI program that checks the local 
time after syncing on a barrier.

I also use:
bgilmer@jupiter:~> srun -t 1 -N90 date +%s | sort -n | uniq -c
     90 1484229062

to see if there is a spread.  On a small job like this there should not be any 
variance. As the number of tasks increase the you should begin to see a spread.

bgilmer@jupiter:~> srun -t 1 -n 1024 date +%s | sort -n | uniq -c
      1 1484229252
    343 1484229253
    278 1484229254
    232 1484229255
    170 1484229256

This was run on a small Cray XC.

Hope this was helpful.


On Wed, Jan 11, 2017 at 1:22 PM, Pritchard Jr., Howard 
<howa...@lanl.gov<mailto:howa...@lanl.gov>> wrote:
Hi SLURM folks,

I recently got SLURM (16.05.6) set up on a small cluster (48 nodes x86_64 + 
Intel OPA)
and things appear to be nominal except for one odd performance problem
as far as srun launch times go.  I don’t observe this on other clusters running
SLURM at our site.

What I’m observing is that regardless of whether or not the application being
launched is a command (e.g. /bin/hostname) or an MPI application, I get 
reasonable
job launch times when using one node, but as soon as I use two or morenodes, 
there
is about a 10 second overhead to get the processes on the additional nodes 
started:

For example:


[hpp@hi-master ~]$ srun -n 8 -N 1 date

Wed Jan 11 12:11:29 MST 2017

Wed Jan 11 12:11:29 MST 2017

Wed Jan 11 12:11:29 MST 2017

Wed Jan 11 12:11:29 MST 2017

Wed Jan 11 12:11:29 MST 2017

Wed Jan 11 12:11:29 MST 2017

Wed Jan 11 12:11:29 MST 2017

Wed Jan 11 12:11:29 MST 2017


[hpp@hi-master ~]$ srun -n 8 -N 2 date

Wed Jan 11 12:10:35 MST 2017

Wed Jan 11 12:10:35 MST 2017

Wed Jan 11 12:10:35 MST 2017

Wed Jan 11 12:10:35 MST 2017

Wed Jan 11 12:10:44 MST 2017

Wed Jan 11 12:10:44 MST 2017

Wed Jan 11 12:10:44 MST 2017

Wed Jan 11 12:10:44 MST 2017


[hpp@hi-master ~]$ srun -n 8 -N 4 date

Wed Jan 11 12:10:57 MST 2017

Wed Jan 11 12:10:57 MST 2017

Wed Jan 11 12:11:07 MST 2017

Wed Jan 11 12:11:06 MST 2017

Wed Jan 11 12:11:07 MST 2017

Wed Jan 11 12:11:06 MST 2017

Wed Jan 11 12:11:07 MST 2017

Wed Jan 11 12:11:07 MST 2017


Anyone observed this problem before?


Any suggestions on how to resolve this problem would be much

appreciated.


Thanks,


Howard


--
Howard Pritchard
HPC-DES
Los Alamos National Laboratory




--
Speak when you are angry--and you will make the best speech you'll ever regret.
  - Laurence J. Peter

Reply via email to