Dear All:

I am trying slurm to Use PrologSlurmctld, To do this in the configuration file 
I have enabled the variable:



PrologSlurmctld=/usr/local/etc/bin/prologoslurmctld



The problem I get is that when I send a job to the queue system is that the 
work remains pending and never comes launches.



-bash-3.2$ /usr/local/slurm-2.3.2/bin/sbatch -p sec4000 --qos=sec4000  
lanza09-2-b agua
Submitted batch job 364


-bash-3.2$ /usr/local/slurm-2.3.2/bin/squeue 
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    364   sec4000 lanza09-  lfelipe  PD       0:00      1 (None)


-bash-3.2$ /usr/local/slurm-2.3.2/bin/scontrol show job 364
JobId=364 Name=lanza09-2-b
   UserId=lfelipe(907) GroupId=root(0)
   Priority=10016 Account=cccuam QOS=sec4000
   JobState=PENDING Reason=None Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=5-00:00:00 TimeMin=N/A
   SubmitTime=2012-01-25T10:17:26 EligibleTime=2012-01-25T10:17:36
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=sec4000 AllocNode:Sid=terpsichore:2435
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=asterix2
   NumNodes=1 NumCPUs=1-1 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/lfelipe/pruebas/slurm/gaussian/lanza09-2-b agua
   WorkDir=/home/lfelipe/pruebas/slurm/gaussian



As you can see:

Slurmctld.log:
.
.
.
[2012-01-25T09:09:52] debug2: node_did_resp calc1
[2012-01-25T09:09:52] debug2: node_did_resp calc2
[2012-01-25T09:09:52] debug2: node_did_resp calc3
[2012-01-25T09:09:52] debug2: node_did_resp calc4
[2012-01-25T09:09:52] debug2: node_did_resp calc5
[2012-01-25T09:09:52] debug2: node_did_resp calc6
[2012-01-25T09:09:52] debug2: node_did_resp calc7
[2012-01-25T09:09:52] debug2: node_did_resp calc8
[2012-01-25T09:09:52] debug2: node_did_resp calc9
[2012-01-25T09:09:52] debug2: node_did_resp calc10
[2012-01-25T09:09:52] debug2: node_did_resp calc11
[2012-01-25T09:09:52] debug2: node_did_resp calc15
[2012-01-25T09:09:52] debug2: node_did_resp calc16
[2012-01-25T09:09:52] debug2: node_did_resp calc17
[2012-01-25T09:09:52] debug2: node_did_resp calc18
[2012-01-25T09:09:52] debug2: node_did_resp calc19
[2012-01-25T09:09:52] debug2: node_did_resp calc20
[2012-01-25T09:09:52] debug2: node_did_resp calc21
[2012-01-25T09:09:52] debug2: node_did_resp calc22
[2012-01-25T09:09:52] debug2: node_did_resp calc23
[2012-01-25T09:09:52] debug2: node_did_resp calc24
[2012-01-25T09:09:54] debug2: Testing job time limits and checkpoints
[2012-01-25T09:10:24] debug2: Testing job time limits and checkpoints
[2012-01-25T09:10:24] debug2: Performing purge of old job records
[2012-01-25T09:10:24] debug2: purge_old_job: purged 1 old job records
[2012-01-25T09:10:24] debug:  sched: Running job scheduler
[2012-01-25T09:10:49] debug:  backfill: no jobs to backfill
[2012-01-25T09:10:54] debug2: Testing job time limits and checkpoints
[2012-01-25T09:11:05] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from 
uid=907
[2012-01-25T09:11:05] debug2: initial priority for job 362 is 10016
[2012-01-25T09:11:05] debug2: found 2 usable nodes from config containing 
asterix[2-3]
[2012-01-25T09:11:05] debug2: sched: JobId=362 allocated resources: 
NodeList=(null)
[2012-01-25T09:11:05] _slurm_rpc_submit_batch_job JobId=362 usec=538
[2012-01-25T09:11:05] debug:  sched: Running job scheduler
[2012-01-25T09:11:05] debug2: found 2 usable nodes from config containing 
asterix[2-3]
[2012-01-25T09:11:05] sched: Allocate JobId=362 NodeList=asterix2 #CPUs=1
[2012-01-25T09:11:05] error: prolog_slurmctld job 362 prolog exit status 1:0
[2012-01-25T09:11:05] debug2: Spawning RPC agent for msg_type 6011
[2012-01-25T09:11:05] error: slurm_jobcomp plugin context not initialized
[2012-01-25T09:11:05] debug2: got 1 threads to send out
[2012-01-25T09:11:05] debug2: Tree head got back 0 looking for 1
[2012-01-25T09:11:05] debug2: Tree head got back 1
[2012-01-25T09:11:05] debug2: Tree head got them all
[2012-01-25T09:11:05] requeue batch job 362
[2012-01-25T09:11:05] debug2: node_did_resp asterix2
[2012-01-25T09:11:05] debug:  sched: Running job scheduler
[2012-01-25T09:11:19] debug:  backfill: no jobs to backfill
[2012-01-25T09:11:24] debug2: Testing job time limits and checkpoints
[2012-01-25T09:11:24] debug2: Performing purge of old job records
[2012-01-25T09:11:24] debug:  sched: Running job scheduler
[2012-01-25T09:11:24] debug2: found 2 usable nodes from config containing 
asterix[2-3]
[2012-01-25T09:11:24] sched: Allocate JobId=362 NodeList=asterix2 #CPUs=1
[2012-01-25T09:11:24] debug2: Performing full system state save
[2012-01-25T09:11:24] error: prolog_slurmctld job 362 prolog exit status 1:0
[2012-01-25T09:11:24] prolog_slurmctld failed again for job 362
[2012-01-25T09:11:24] debug2: Spawning RPC agent for msg_type 4022
[2012-01-25T09:11:24] debug2: got 1 threads to send out
[2012-01-25T09:11:24] debug2: Spawning RPC agent for msg_type 6011
[2012-01-25T09:11:24] error: slurm_jobcomp plugin context not initialized
[2012-01-25T09:11:24] job_signal 9 of running job 362 successful
[2012-01-25T09:11:24] debug2: got 1 threads to send out
[2012-01-25T09:11:24] debug2: Tree head got back 0 looking for 1
[2012-01-25T09:11:24] debug2: Tree head got back 1
[2012-01-25T09:11:24] debug2: Tree head got them all
[2012-01-25T09:11:24] debug2: node_did_resp asterix2
[2012-01-25T09:11:24] debug:  sched: Running job scheduler
[2012-01-25T09:11:24] debug2: node_did_resp asterix2
[2012-01-25T09:11:49] debug:  backfill: no jobs to backfill
[2012-01-25T09:11:54] debug2: Testing job time limits and checkpoints

As You can see:

[2012-01-25T09:11:24] error: prolog_slurmctld job 362 prolog exit status 1:0
[2012-01-25T09:11:24] prolog_slurmctld failed again for job 362



In the client node:



slurmd.log:

.
.
.
[2012-01-25T09:09:48] got reconfigure request
[2012-01-25T09:11:05] debug:  _rpc_terminate_job, uid = 106
[2012-01-25T09:11:05] debug:  task_slurmd_release_resources: 362
[2012-01-25T09:11:05] debug:  credential for job 362 revoked
[2012-01-25T09:11:24] debug:  _rpc_job_notify, uid = 106, jobid = 362
[2012-01-25T09:11:24] debug:  _rpc_terminate_job, uid = 106
[2012-01-25T09:11:24] debug:  task_slurmd_release_resources: 362
[2012-01-25T09:11:24] debug:  job 362 requeued, but started no tasks
[2012-01-25T09:11:24] debug:  credential for job 362 revoked

Note: I tell the script I am using to show me the SLURM_JOB_PARTITION and 
SLURM_JOB_ID variables. I also have set the "Prolog" option and this last 
worked good.
 



Sincerely,

Luis Felipe Ruiz Nieto


Reply via email to