Thanks again. This time I stopped all the nodes and restarted them. When I look at the slurm log file for the nodes, I see the following message.
[2011-02-14T10:20:48] Procs=2 Sockets=2 Cores=1 Threads=1 Memory=4048 TmpDisk=26074 Uptime=7566205 [2011-02-14T10:20:48] debug2: _slurm_connect failed: Connection refused [2011-02-14T10:20:48] debug2: Error connecting slurm stream socket at 192.168.8.119:6817: Connection refused Any idea what may be causing this. Also should all the compute nodes be able to talk to each other.? On Mon, Feb 14, 2011 at 9:26 AM, Jette, Moe <[email protected]> wrote: > SLURM uses hierarchical communications between the compute nodes. > I'd _guess_ that you don't have a consistent slurm.conf file across all > nodes > or you failed to restart all of the slurmd daemons on the compute nodes > after adding the new nodes > or your networking doesn't support communications between all of the > compute nodes. > Also take a look in the SlurmctldLogFile plus SlurmdLogFile on abc001 for > more information about that job. > ________________________________________ > From: [email protected] [[email protected]] On > Behalf Of Paul Thirumalai [[email protected]] > Sent: Monday, February 14, 2011 9:22 AM > To: [email protected] > Subject: Re: [slurm-dev] sbatch seems to have stopped working > > Thanks for the input. now when i do an squeue -tall I get teh following > output > > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > 2774 all_part mysrun.s kdsd03 CD 0:00 1 abc001 > > This would indicate that the job completed. Howeveer the output file was > not created., which tells me that the job did not run. > > The srun comman in mysrun.sh is > > srun -N1 -o /home/kdsd03/oas/klurm/test.out /home/kdsd03/oas/klurm/test.sh > input. > > the test.sh script basically echos the input. So test.out should contain > the input. > > Now when I remove these abc nodes, everything seems to work fine. > > > On Fri, Feb 11, 2011 at 11:25 AM, Jette, Moe <[email protected]<mailto: > [email protected]>> wrote: > squeue by default only shows running or pending jobs. > Your job either completed or failed (error state). > Try "squeue -tall" or "squeue --state=all" or "scontrol show job <jobid>" > > ________________________________________ > From: [email protected]<mailto:[email protected]> > [[email protected]<mailto:[email protected]>] On > Behalf Of Paul Thirumalai [[email protected]<mailto: > [email protected]>] > Sent: Friday, February 11, 2011 10:45 AM > To: [email protected]<mailto:[email protected]> > Subject: [slurm-dev] sbatch seems to have stopped working > > So I had a slurm setup that was working fine. > I made the following configuration changes. > 1. Added about 150 more nodes to the slurm setup > 2. Added a new partition for these nodes > 3. Added a 3rd logical partiftion that contains all the nodes > 4. Changes SelectType to select/cons_res > 5. Changed SelectTypeParamters to CR_Core_Memory. > > Now after I make the changes it seems as though sbatch does not work. > > When I submit a job using sbatch at the command line it says "Submitted > batch job <jobid>" > But when i do an squeue I dont see that job running. > > If I submit the same job using srun, it works fine. > > Any help would be appreciated. > > Thanks > Paul > > >
