This may help: https://computing.llnl.gov/linux/slurm/troubleshoot.html#network
All of the nodes talking with each other is how SLURM's hierarchical communications work. ________________________________________ From: [email protected] [[email protected]] On Behalf Of Paul Thirumalai [[email protected]] Sent: Monday, February 14, 2011 10:41 AM To: [email protected] Subject: Re: [slurm-dev] sbatch seems to have stopped working Thanks again. This time I stopped all the nodes and restarted them. When I look at the slurm log file for the nodes, I see the following message. [2011-02-14T10:20:48] Procs=2 Sockets=2 Cores=1 Threads=1 Memory=4048 TmpDisk=26074 Uptime=7566205 [2011-02-14T10:20:48] debug2: _slurm_connect failed: Connection refused [2011-02-14T10:20:48] debug2: Error connecting slurm stream socket at 192.168.8.119:6817<http://192.168.8.119:6817>: Connection refused Any idea what may be causing this. Also should all the compute nodes be able to talk to each other.? On Mon, Feb 14, 2011 at 9:26 AM, Jette, Moe <[email protected]<mailto:[email protected]>> wrote: SLURM uses hierarchical communications between the compute nodes. I'd _guess_ that you don't have a consistent slurm.conf file across all nodes or you failed to restart all of the slurmd daemons on the compute nodes after adding the new nodes or your networking doesn't support communications between all of the compute nodes. Also take a look in the SlurmctldLogFile plus SlurmdLogFile on abc001 for more information about that job. ________________________________________ From: [email protected]<mailto:[email protected]> [[email protected]<mailto:[email protected]>] On Behalf Of Paul Thirumalai [[email protected]<mailto:[email protected]>] Sent: Monday, February 14, 2011 9:22 AM To: [email protected]<mailto:[email protected]> Subject: Re: [slurm-dev] sbatch seems to have stopped working Thanks for the input. now when i do an squeue -tall I get teh following output JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2774 all_part mysrun.s kdsd03 CD 0:00 1 abc001 This would indicate that the job completed. Howeveer the output file was not created., which tells me that the job did not run. The srun comman in mysrun.sh is srun -N1 -o /home/kdsd03/oas/klurm/test.out /home/kdsd03/oas/klurm/test.sh input. the test.sh script basically echos the input. So test.out should contain the input. Now when I remove these abc nodes, everything seems to work fine. On Fri, Feb 11, 2011 at 11:25 AM, Jette, Moe <[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>> wrote: squeue by default only shows running or pending jobs. Your job either completed or failed (error state). Try "squeue -tall" or "squeue --state=all" or "scontrol show job <jobid>" ________________________________________ From: [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>> [[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>] On Behalf Of Paul Thirumalai [[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>] Sent: Friday, February 11, 2011 10:45 AM To: [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>> Subject: [slurm-dev] sbatch seems to have stopped working So I had a slurm setup that was working fine. I made the following configuration changes. 1. Added about 150 more nodes to the slurm setup 2. Added a new partition for these nodes 3. Added a 3rd logical partiftion that contains all the nodes 4. Changes SelectType to select/cons_res 5. Changed SelectTypeParamters to CR_Core_Memory. Now after I make the changes it seems as though sbatch does not work. When I submit a job using sbatch at the command line it says "Submitted batch job <jobid>" But when i do an squeue I dont see that job running. If I submit the same job using srun, it works fine. Any help would be appreciated. Thanks Paul
