Dear All:
I am trying slurm to Use PrologSlurmctld, To do this in the configuration file I have enabled the variable: PrologSlurmctld=/usr/local/etc/bin/prologoslurmctld The problem I get is that when I send a job to the queue system is that the work remains pending and never comes launches. -bash-3.2$ /usr/local/slurm-2.3.2/bin/sbatch -p sec4000 --qos=sec4000 lanza09-2-b agua Submitted batch job 364 -bash-3.2$ /usr/local/slurm-2.3.2/bin/squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 364 sec4000 lanza09- lfelipe PD 0:00 1 (None) -bash-3.2$ /usr/local/slurm-2.3.2/bin/scontrol show job 364 JobId=364 Name=lanza09-2-b UserId=lfelipe(907) GroupId=root(0) Priority=10016 Account=cccuam QOS=sec4000 JobState=PENDING Reason=None Dependency=(null) Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=5-00:00:00 TimeMin=N/A SubmitTime=2012-01-25T10:17:26 EligibleTime=2012-01-25T10:17:36 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=sec4000 AllocNode:Sid=terpsichore:2435 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) BatchHost=asterix2 NumNodes=1 NumCPUs=1-1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/lfelipe/pruebas/slurm/gaussian/lanza09-2-b agua WorkDir=/home/lfelipe/pruebas/slurm/gaussian As you can see: Slurmctld.log: . . . [2012-01-25T09:09:52] debug2: node_did_resp calc1 [2012-01-25T09:09:52] debug2: node_did_resp calc2 [2012-01-25T09:09:52] debug2: node_did_resp calc3 [2012-01-25T09:09:52] debug2: node_did_resp calc4 [2012-01-25T09:09:52] debug2: node_did_resp calc5 [2012-01-25T09:09:52] debug2: node_did_resp calc6 [2012-01-25T09:09:52] debug2: node_did_resp calc7 [2012-01-25T09:09:52] debug2: node_did_resp calc8 [2012-01-25T09:09:52] debug2: node_did_resp calc9 [2012-01-25T09:09:52] debug2: node_did_resp calc10 [2012-01-25T09:09:52] debug2: node_did_resp calc11 [2012-01-25T09:09:52] debug2: node_did_resp calc15 [2012-01-25T09:09:52] debug2: node_did_resp calc16 [2012-01-25T09:09:52] debug2: node_did_resp calc17 [2012-01-25T09:09:52] debug2: node_did_resp calc18 [2012-01-25T09:09:52] debug2: node_did_resp calc19 [2012-01-25T09:09:52] debug2: node_did_resp calc20 [2012-01-25T09:09:52] debug2: node_did_resp calc21 [2012-01-25T09:09:52] debug2: node_did_resp calc22 [2012-01-25T09:09:52] debug2: node_did_resp calc23 [2012-01-25T09:09:52] debug2: node_did_resp calc24 [2012-01-25T09:09:54] debug2: Testing job time limits and checkpoints [2012-01-25T09:10:24] debug2: Testing job time limits and checkpoints [2012-01-25T09:10:24] debug2: Performing purge of old job records [2012-01-25T09:10:24] debug2: purge_old_job: purged 1 old job records [2012-01-25T09:10:24] debug: sched: Running job scheduler [2012-01-25T09:10:49] debug: backfill: no jobs to backfill [2012-01-25T09:10:54] debug2: Testing job time limits and checkpoints [2012-01-25T09:11:05] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=907 [2012-01-25T09:11:05] debug2: initial priority for job 362 is 10016 [2012-01-25T09:11:05] debug2: found 2 usable nodes from config containing asterix[2-3] [2012-01-25T09:11:05] debug2: sched: JobId=362 allocated resources: NodeList=(null) [2012-01-25T09:11:05] _slurm_rpc_submit_batch_job JobId=362 usec=538 [2012-01-25T09:11:05] debug: sched: Running job scheduler [2012-01-25T09:11:05] debug2: found 2 usable nodes from config containing asterix[2-3] [2012-01-25T09:11:05] sched: Allocate JobId=362 NodeList=asterix2 #CPUs=1 [2012-01-25T09:11:05] error: prolog_slurmctld job 362 prolog exit status 1:0 [2012-01-25T09:11:05] debug2: Spawning RPC agent for msg_type 6011 [2012-01-25T09:11:05] error: slurm_jobcomp plugin context not initialized [2012-01-25T09:11:05] debug2: got 1 threads to send out [2012-01-25T09:11:05] debug2: Tree head got back 0 looking for 1 [2012-01-25T09:11:05] debug2: Tree head got back 1 [2012-01-25T09:11:05] debug2: Tree head got them all [2012-01-25T09:11:05] requeue batch job 362 [2012-01-25T09:11:05] debug2: node_did_resp asterix2 [2012-01-25T09:11:05] debug: sched: Running job scheduler [2012-01-25T09:11:19] debug: backfill: no jobs to backfill [2012-01-25T09:11:24] debug2: Testing job time limits and checkpoints [2012-01-25T09:11:24] debug2: Performing purge of old job records [2012-01-25T09:11:24] debug: sched: Running job scheduler [2012-01-25T09:11:24] debug2: found 2 usable nodes from config containing asterix[2-3] [2012-01-25T09:11:24] sched: Allocate JobId=362 NodeList=asterix2 #CPUs=1 [2012-01-25T09:11:24] debug2: Performing full system state save [2012-01-25T09:11:24] error: prolog_slurmctld job 362 prolog exit status 1:0 [2012-01-25T09:11:24] prolog_slurmctld failed again for job 362 [2012-01-25T09:11:24] debug2: Spawning RPC agent for msg_type 4022 [2012-01-25T09:11:24] debug2: got 1 threads to send out [2012-01-25T09:11:24] debug2: Spawning RPC agent for msg_type 6011 [2012-01-25T09:11:24] error: slurm_jobcomp plugin context not initialized [2012-01-25T09:11:24] job_signal 9 of running job 362 successful [2012-01-25T09:11:24] debug2: got 1 threads to send out [2012-01-25T09:11:24] debug2: Tree head got back 0 looking for 1 [2012-01-25T09:11:24] debug2: Tree head got back 1 [2012-01-25T09:11:24] debug2: Tree head got them all [2012-01-25T09:11:24] debug2: node_did_resp asterix2 [2012-01-25T09:11:24] debug: sched: Running job scheduler [2012-01-25T09:11:24] debug2: node_did_resp asterix2 [2012-01-25T09:11:49] debug: backfill: no jobs to backfill [2012-01-25T09:11:54] debug2: Testing job time limits and checkpoints As You can see: [2012-01-25T09:11:24] error: prolog_slurmctld job 362 prolog exit status 1:0 [2012-01-25T09:11:24] prolog_slurmctld failed again for job 362 In the client node: slurmd.log: . . . [2012-01-25T09:09:48] got reconfigure request [2012-01-25T09:11:05] debug: _rpc_terminate_job, uid = 106 [2012-01-25T09:11:05] debug: task_slurmd_release_resources: 362 [2012-01-25T09:11:05] debug: credential for job 362 revoked [2012-01-25T09:11:24] debug: _rpc_job_notify, uid = 106, jobid = 362 [2012-01-25T09:11:24] debug: _rpc_terminate_job, uid = 106 [2012-01-25T09:11:24] debug: task_slurmd_release_resources: 362 [2012-01-25T09:11:24] debug: job 362 requeued, but started no tasks [2012-01-25T09:11:24] debug: credential for job 362 revoked Note: I tell the script I am using to show me the SLURM_JOB_PARTITION and SLURM_JOB_ID variables. I also have set the "Prolog" option and this last worked good. Sincerely, Luis Felipe Ruiz Nieto