[slurm-dev] Re: Some nodes are not responding..

2013-02-19 Thread Sefa Arslan
I am able to make ssh , telnet (to the slurmd port 6818) and ping lufer121 from the controller although the node seem to be down according to the sview and slurmctld log. Also I can send and run jobs on the node via srun or sbatch using --nodelist before the the node is marked as down. The "

[slurm-dev] Re: GrpCPUMins vs GrpCPURunMins

2013-02-19 Thread Mark Nelson
Hi Lloyd, The GrpCPUMins limit is a limit on the number of CPU minutes that an account can consume before they're stopped and unable to launch any more jobs. Once this happens their usage is decayed (at a rate of PriorityDecayHalfLife) or their usage is reset (according to PriorityUsageResetP

[slurm-dev] Re: X11 forwarding for interactive jobs?

2013-02-19 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 15/02/13 19:48, Matthieu Hautreux wrote: > I have just pushed the latest version on github which correct some > bugs. You should grab that one if you plan to do some tests. Thanks for that, I've been distracted by moving house and the resulting b

[slurm-dev] Re: task/affinity, --cpu_bind=socket and -c > 1

2013-02-19 Thread Martin . Perry
Magnus, I'm not sure exactly what pattern of allocation and binding you're trying to achieve. If you can describe clearly the layout you want I might be able to suggest the combination of options you need to use. If you want each task to be confined to a single socket, you should be able to do

[slurm-dev] Re: Some nodes are not responding..

2013-02-19 Thread David Bigagli
->[2013-02-19T17:38:59] agent/is_node_resp: node:lufer121 rpc:1008 : Can't find an address, check slurm.conf This is the error. The controller either does not know the machine, there are some problems with its address or communication (name resolution?). Is lufer121 talking to the controller using

[slurm-dev] Re: batch submit failure -Slurmd could not create a batch directory or file

2013-02-19 Thread Kevin Abbey
Dear Carles, and slurm-dev-list, After further review of more detailed logs I am convinced that this issue is due to incorrect nfs4 settings or an nfs4 bug (which are present in RH, Centos, Fedora lists.). I'm still working to resolve this. Below are the additional notes if anyone has overc

[slurm-dev] Re: Cannot submit jobs to slurm

2013-02-19 Thread Moe Jette
Look at your slurmd log file on bengal30 I would also recommend upgrading. Slurm v2.2 is two years old. Quoting Mads Boye : > Hi. > I've got slurm-2.2.0 running, and until yesterday it worked fine. > When it try to run my test script "test.sh": > > #!/bin/sh > > ### Job name > #SBATCH --job-nam

[slurm-dev] Re: Some nodes are not responding..

2013-02-19 Thread Sefa Arslan
There is not much information on worker's log.. [2013-02-19T17:34:11] slurmd version 2.4.1 started [2013-02-19T17:34:11] debug3: finished daemonize [2013-02-19T17:34:11] debug3: Trying to load plugin /usr/lib64/slurm/switch_none.so [2013-02-19T17:34:11] switch NONE plugin loaded [2013-02-19T17:

[slurm-dev] Re: Some nodes are not responding..

2013-02-19 Thread David Bigagli
Hi, have a look at the slurmd.log on the none responding hosts. Does it show any relevant information? /David On Tue, Feb 19, 2013 at 2:46 PM, Sefa Arslan wrote: > > Hello, > > After a few minutes I restart the SLURM service on the worker node, the > controller says the worker node is not respo