Thanks again.
This time I stopped all the nodes and restarted them.
When I look at the slurm log file for the nodes, I see the following
message.

[2011-02-14T10:20:48] Procs=2 Sockets=2 Cores=1 Threads=1 Memory=4048
TmpDisk=26074 Uptime=7566205
[2011-02-14T10:20:48] debug2: _slurm_connect failed: Connection refused
[2011-02-14T10:20:48] debug2: Error connecting slurm stream socket at
192.168.8.119:6817: Connection refused


Any idea what may be causing this. Also should all the compute nodes be able
to talk to each other.?

On Mon, Feb 14, 2011 at 9:26 AM, Jette, Moe <[email protected]> wrote:

> SLURM uses hierarchical communications between the compute nodes.
> I'd _guess_ that you don't have a consistent slurm.conf file across all
> nodes
> or you failed to restart all of the slurmd daemons on the compute nodes
> after adding the new nodes
> or your networking doesn't support communications between all of the
> compute nodes.
> Also take a look in the SlurmctldLogFile plus SlurmdLogFile on abc001 for
> more information about that job.
> ________________________________________
> From: [email protected] [[email protected]] On
> Behalf Of Paul Thirumalai [[email protected]]
> Sent: Monday, February 14, 2011 9:22 AM
> To: [email protected]
> Subject: Re: [slurm-dev] sbatch seems to have stopped working
>
> Thanks for the input. now when i do an squeue -tall I get teh following
> output
>
> JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
>   2774  all_part mysrun.s   kdsd03  CD       0:00      1 abc001
>
> This would indicate that the job completed. Howeveer the output file was
> not created., which tells me that the job did not run.
>
> The srun comman in mysrun.sh is
>
> srun -N1 -o /home/kdsd03/oas/klurm/test.out /home/kdsd03/oas/klurm/test.sh
> input.
>
> the test.sh script basically echos the input. So test.out should contain
> the input.
>
> Now when I remove these abc nodes, everything seems to work fine.
>
>
> On Fri, Feb 11, 2011 at 11:25 AM, Jette, Moe <[email protected]<mailto:
> [email protected]>> wrote:
> squeue by default only shows running or pending jobs.
> Your job either completed or failed (error state).
> Try "squeue -tall" or "squeue --state=all" or "scontrol show job <jobid>"
>
> ________________________________________
> From: [email protected]<mailto:[email protected]>
> [[email protected]<mailto:[email protected]>] On
> Behalf Of Paul Thirumalai [[email protected]<mailto:
> [email protected]>]
> Sent: Friday, February 11, 2011 10:45 AM
> To: [email protected]<mailto:[email protected]>
> Subject: [slurm-dev] sbatch seems to have stopped working
>
> So I had a slurm setup that was working fine.
> I made the following configuration changes.
> 1. Added about 150 more nodes to the slurm setup
> 2. Added a new partition for these nodes
> 3. Added a 3rd logical partiftion that contains all the nodes
> 4. Changes SelectType to select/cons_res
> 5. Changed SelectTypeParamters to CR_Core_Memory.
>
> Now after I make the changes it seems as though sbatch does not work.
>
> When I submit a job using sbatch at the command line it  says "Submitted
> batch job <jobid>"
> But when i do an squeue I dont see that job running.
>
> If I submit the same job using srun, it works fine.
>
> Any help would be appreciated.
>
> Thanks
> Paul
>
>
>

Reply via email to