Also make sure the script's exit code is zero to continue. Any
non-zero exit code is treated like an error condition.
Quoting Carles Fenoy <mini...@gmail.com>:
Hola Luis,
Have you tryed to execute the slurmctldprolog command as the SlurmUser
in the controller node? Is it working?
Carles Fenoy
On Wed, Jan 25, 2012 at 2:10 PM, luis <luis.r...@uam.es> wrote:
Dear All:
I am trying slurm to Use PrologSlurmctld, To do this in the configuration
file I have enabled the variable:
PrologSlurmctld=/usr/local/etc/bin/prologoslurmctld
The problem I get is that when I send a job to the queue system is that the
work remains pending and never comes launches.
-bash-3.2$ /usr/local/slurm-2.3.2/bin/sbatch -p sec4000 --qos=sec4000
lanza09-2-b agua
Submitted batch job 364
-bash-3.2$ /usr/local/slurm-2.3.2/bin/squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
364 sec4000 lanza09- lfelipe PD 0:00 1 (None)
-bash-3.2$ /usr/local/slurm-2.3.2/bin/scontrol show job 364
JobId=364 Name=lanza09-2-b
UserId=lfelipe(907) GroupId=root(0)
Priority=10016 Account=cccuam QOS=sec4000
JobState=PENDING Reason=None Dependency=(null)
Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=5-00:00:00 TimeMin=N/A
SubmitTime=2012-01-25T10:17:26 EligibleTime=2012-01-25T10:17:36
StartTime=Unknown EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=sec4000 AllocNode:Sid=terpsichore:2435
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
BatchHost=asterix2
NumNodes=1 NumCPUs=1-1 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/lfelipe/pruebas/slurm/gaussian/lanza09-2-b agua
WorkDir=/home/lfelipe/pruebas/slurm/gaussian
As you can see:
Slurmctld.log:
.
.
.
[2012-01-25T09:09:52] debug2: node_did_resp calc1
[2012-01-25T09:09:52] debug2: node_did_resp calc2
[2012-01-25T09:09:52] debug2: node_did_resp calc3
[2012-01-25T09:09:52] debug2: node_did_resp calc4
[2012-01-25T09:09:52] debug2: node_did_resp calc5
[2012-01-25T09:09:52] debug2: node_did_resp calc6
[2012-01-25T09:09:52] debug2: node_did_resp calc7
[2012-01-25T09:09:52] debug2: node_did_resp calc8
[2012-01-25T09:09:52] debug2: node_did_resp calc9
[2012-01-25T09:09:52] debug2: node_did_resp calc10
[2012-01-25T09:09:52] debug2: node_did_resp calc11
[2012-01-25T09:09:52] debug2: node_did_resp calc15
[2012-01-25T09:09:52] debug2: node_did_resp calc16
[2012-01-25T09:09:52] debug2: node_did_resp calc17
[2012-01-25T09:09:52] debug2: node_did_resp calc18
[2012-01-25T09:09:52] debug2: node_did_resp calc19
[2012-01-25T09:09:52] debug2: node_did_resp calc20
[2012-01-25T09:09:52] debug2: node_did_resp calc21
[2012-01-25T09:09:52] debug2: node_did_resp calc22
[2012-01-25T09:09:52] debug2: node_did_resp calc23
[2012-01-25T09:09:52] debug2: node_did_resp calc24
[2012-01-25T09:09:54] debug2: Testing job time limits and checkpoints
[2012-01-25T09:10:24] debug2: Testing job time limits and checkpoints
[2012-01-25T09:10:24] debug2: Performing purge of old job records
[2012-01-25T09:10:24] debug2: purge_old_job: purged 1 old job records
[2012-01-25T09:10:24] debug: sched: Running job scheduler
[2012-01-25T09:10:49] debug: backfill: no jobs to backfill
[2012-01-25T09:10:54] debug2: Testing job time limits and checkpoints
[2012-01-25T09:11:05] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from
uid=907
[2012-01-25T09:11:05] debug2: initial priority for job 362 is 10016
[2012-01-25T09:11:05] debug2: found 2 usable nodes from config containing
asterix[2-3]
[2012-01-25T09:11:05] debug2: sched: JobId=362 allocated resources:
NodeList=(null)
[2012-01-25T09:11:05] _slurm_rpc_submit_batch_job JobId=362 usec=538
[2012-01-25T09:11:05] debug: sched: Running job scheduler
[2012-01-25T09:11:05] debug2: found 2 usable nodes from config containing
asterix[2-3]
[2012-01-25T09:11:05] sched: Allocate JobId=362 NodeList=asterix2 #CPUs=1
[2012-01-25T09:11:05] error: prolog_slurmctld job 362 prolog exit status 1:0
[2012-01-25T09:11:05] debug2: Spawning RPC agent for msg_type 6011
[2012-01-25T09:11:05] error: slurm_jobcomp plugin context not initialized
[2012-01-25T09:11:05] debug2: got 1 threads to send out
[2012-01-25T09:11:05] debug2: Tree head got back 0 looking for 1
[2012-01-25T09:11:05] debug2: Tree head got back 1
[2012-01-25T09:11:05] debug2: Tree head got them all
[2012-01-25T09:11:05] requeue batch job 362
[2012-01-25T09:11:05] debug2: node_did_resp asterix2
[2012-01-25T09:11:05] debug: sched: Running job scheduler
[2012-01-25T09:11:19] debug: backfill: no jobs to backfill
[2012-01-25T09:11:24] debug2: Testing job time limits and checkpoints
[2012-01-25T09:11:24] debug2: Performing purge of old job records
[2012-01-25T09:11:24] debug: sched: Running job scheduler
[2012-01-25T09:11:24] debug2: found 2 usable nodes from config containing
asterix[2-3]
[2012-01-25T09:11:24] sched: Allocate JobId=362 NodeList=asterix2 #CPUs=1
[2012-01-25T09:11:24] debug2: Performing full system state save
[2012-01-25T09:11:24] error: prolog_slurmctld job 362 prolog exit status 1:0
[2012-01-25T09:11:24] prolog_slurmctld failed again for job 362
[2012-01-25T09:11:24] debug2: Spawning RPC agent for msg_type 4022
[2012-01-25T09:11:24] debug2: got 1 threads to send out
[2012-01-25T09:11:24] debug2: Spawning RPC agent for msg_type 6011
[2012-01-25T09:11:24] error: slurm_jobcomp plugin context not initialized
[2012-01-25T09:11:24] job_signal 9 of running job 362 successful
[2012-01-25T09:11:24] debug2: got 1 threads to send out
[2012-01-25T09:11:24] debug2: Tree head got back 0 looking for 1
[2012-01-25T09:11:24] debug2: Tree head got back 1
[2012-01-25T09:11:24] debug2: Tree head got them all
[2012-01-25T09:11:24] debug2: node_did_resp asterix2
[2012-01-25T09:11:24] debug: sched: Running job scheduler
[2012-01-25T09:11:24] debug2: node_did_resp asterix2
[2012-01-25T09:11:49] debug: backfill: no jobs to backfill
[2012-01-25T09:11:54] debug2: Testing job time limits and checkpoints
As You can see:
[2012-01-25T09:11:24] error: prolog_slurmctld job 362 prolog exit status 1:0
[2012-01-25T09:11:24] prolog_slurmctld failed again for job 362
In the client node:
slurmd.log:
.
.
.
[2012-01-25T09:09:48] got reconfigure request
[2012-01-25T09:11:05] debug: _rpc_terminate_job, uid = 106
[2012-01-25T09:11:05] debug: task_slurmd_release_resources: 362
[2012-01-25T09:11:05] debug: credential for job 362 revoked
[2012-01-25T09:11:24] debug: _rpc_job_notify, uid = 106, jobid = 362
[2012-01-25T09:11:24] debug: _rpc_terminate_job, uid = 106
[2012-01-25T09:11:24] debug: task_slurmd_release_resources: 362
[2012-01-25T09:11:24] debug: job 362 requeued, but started no tasks
[2012-01-25T09:11:24] debug: credential for job 362 revoked
Note: I tell the script I am using to show me the SLURM_JOB_PARTITION and
SLURM_JOB_ID variables. I also have set the "Prolog" option and this last
worked good.
Sincerely,
Luis Felipe Ruiz Nieto
--
--
Carles Fenoy