I am able to make ssh , telnet (to the slurmd port 6818) and ping
lufer121 from the controller although the node seem to be down according
to the sview and slurmctld log.
Also I can send and run jobs on the node via srun or sbatch using
--nodelist before the the node is marked as down.
The "
Hi Lloyd,
The GrpCPUMins limit is a limit on the number of CPU minutes that an
account can consume before they're stopped and unable to launch any more
jobs. Once this happens their usage is decayed (at a rate of
PriorityDecayHalfLife) or their usage is reset (according to
PriorityUsageResetP
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On 15/02/13 19:48, Matthieu Hautreux wrote:
> I have just pushed the latest version on github which correct some
> bugs. You should grab that one if you plan to do some tests.
Thanks for that, I've been distracted by moving house and the
resulting b
Magnus,
I'm not sure exactly what pattern of allocation and binding you're trying
to achieve. If you can describe clearly the layout you want I might be
able to suggest the combination of options you need to use. If you want
each task to be confined to a single socket, you should be able to do
->[2013-02-19T17:38:59] agent/is_node_resp: node:lufer121 rpc:1008 : Can't
find an address, check slurm.conf
This is the error. The controller either does not know the machine, there
are some problems with its address or communication (name resolution?).
Is lufer121 talking to the controller using
Dear Carles, and slurm-dev-list,
After further review of more detailed logs I am convinced that this
issue is due to incorrect nfs4 settings or an nfs4 bug (which are
present in RH, Centos, Fedora lists.). I'm still working to resolve
this. Below are the additional notes if anyone has overc
Look at your slurmd log file on bengal30
I would also recommend upgrading. Slurm v2.2 is two years old.
Quoting Mads Boye :
> Hi.
> I've got slurm-2.2.0 running, and until yesterday it worked fine.
> When it try to run my test script "test.sh":
>
> #!/bin/sh
>
> ### Job name
> #SBATCH --job-nam
There is not much information on worker's log..
[2013-02-19T17:34:11] slurmd version 2.4.1 started
[2013-02-19T17:34:11] debug3: finished daemonize
[2013-02-19T17:34:11] debug3: Trying to load plugin
/usr/lib64/slurm/switch_none.so
[2013-02-19T17:34:11] switch NONE plugin loaded
[2013-02-19T17:
Hi, have a look at the slurmd.log on the none responding hosts. Does it
show any relevant information?
/David
On Tue, Feb 19, 2013 at 2:46 PM, Sefa Arslan wrote:
>
> Hello,
>
> After a few minutes I restart the SLURM service on the worker node, the
> controller says the worker node is not respo