My best guess is that your network configuration in slurm is bad so that the 
slurmctld
can talk with the slurmd daemons on compute nodes, but messages are not going
the other way. There's a SLURM troubleshooting guide here that may help:
https://computing.llnl.gov/linux/slurm/troubleshoot.html
________________________________________
From: [email protected] [[email protected]] On Behalf 
Of Paul Thirumalai [[email protected]]
Sent: Tuesday, February 22, 2011 3:19 PM
To: [email protected]
Subject: Re: [slurm-dev] squeue display stale output

I see the following line in slurmctld.log

 Node w51 appears to have a different slurm.conf than the slurmctld.  This 
could cause issues with communication and functionality.  Please review both 
files and make sure they are the same.  If this is expected ignore, and set 
DebugFlags=NO_CONF_HASH in your slurm.conf.

I will sync the conf files and retry

On Tue, Feb 22, 2011 at 3:15 PM, Danny Auble 
<[email protected]<mailto:[email protected]>> wrote:
Is there anything of interest in the slurmctld log?  How about the slurmd log 
on the node running the job?


On 02/22/11 15:14, Paul Thirumalai wrote:
But new jobs are still getting stuck in CG state.

On Tue, Feb 22, 2011 at 3:13 PM, Paul Thirumalai 
<[email protected]<mailto:[email protected]>> wrote:
I am using slurm 2.2. I restarted slurmctld using -c option and squeue not does 
not show any new jobs


On Tue, Feb 22, 2011 at 3:11 PM, Jerry Smith 
<[email protected]<mailto:[email protected]>> wrote:
Have you tried restarting slurmctld?  We had this issue back a few revisions, 
but it seemed to go away with a newer rev ( though I couldn't tell you which 
one ).

Have you tried setting the state to RESUME for the nodes?

What version of Slurm are you running?

Jerry

Paul Thirumalai wrote:
I am not using epilog while launching these jobs. the jobs are simple python 
scripts that run the hostname command and put thte output in a file that is 
provided in teh command line.

I can see that file was written. This tells me that the job completed.
When I login to the node I dont see the process running. However squeue still 
tells me that the job is in CG state.

I stopped all the slurm daemons and restarted them, but the state of th job is 
still CG and it shows up in squeue

On Tue, Feb 22, 2011 at 2:45 PM, Paul Thirumalai 
<[email protected]<mailto:[email protected]>> wrote:
Actually I just figured out that all jobs seem to be stuck in COMPLETING state. 
I am now reading 
https://computing.llnl.gov/linux/slurm/faq.html#comp<http://www.google.com/url?sa=D&q=https://computing.llnl.gov/linux/slurm/faq.html%23comp>


I will continue to trouble shoot. If I run into issues, I will repost to this 
thread.





Reply via email to