Hola Luis, Have you tryed to execute the slurmctldprolog command as the SlurmUser in the controller node? Is it working?
Carles Fenoy On Wed, Jan 25, 2012 at 2:10 PM, luis <luis.r...@uam.es> wrote: > Dear All: > > > I am trying slurm to Use PrologSlurmctld, To do this in the configuration > file I have enabled the variable: > > PrologSlurmctld=/usr/local/etc/bin/prologoslurmctld > > The problem I get is that when I send a job to the queue system is that the > work remains pending and never comes launches. > > -bash-3.2$ /usr/local/slurm-2.3.2/bin/sbatch -p sec4000 --qos=sec4000 > lanza09-2-b agua > > Submitted batch job 364 > > -bash-3.2$ /usr/local/slurm-2.3.2/bin/squeue > > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > > 364 sec4000 lanza09- lfelipe PD 0:00 1 (None) > > -bash-3.2$ /usr/local/slurm-2.3.2/bin/scontrol show job 364 > > JobId=364 Name=lanza09-2-b > > UserId=lfelipe(907) GroupId=root(0) > > Priority=10016 Account=cccuam QOS=sec4000 > > JobState=PENDING Reason=None Dependency=(null) > > Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0 > > RunTime=00:00:00 TimeLimit=5-00:00:00 TimeMin=N/A > > SubmitTime=2012-01-25T10:17:26 EligibleTime=2012-01-25T10:17:36 > > StartTime=Unknown EndTime=Unknown > > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > > Partition=sec4000 AllocNode:Sid=terpsichore:2435 > > ReqNodeList=(null) ExcNodeList=(null) > > NodeList=(null) > > BatchHost=asterix2 > > NumNodes=1 NumCPUs=1-1 CPUs/Task=1 ReqS:C:T=*:*:* > > MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 > > Features=(null) Gres=(null) Reservation=(null) > > Shared=OK Contiguous=0 Licenses=(null) Network=(null) > > Command=/home/lfelipe/pruebas/slurm/gaussian/lanza09-2-b agua > > WorkDir=/home/lfelipe/pruebas/slurm/gaussian > > As you can see: > > Slurmctld.log: > > . > > . > > . > > [2012-01-25T09:09:52] debug2: node_did_resp calc1 > > [2012-01-25T09:09:52] debug2: node_did_resp calc2 > > [2012-01-25T09:09:52] debug2: node_did_resp calc3 > > [2012-01-25T09:09:52] debug2: node_did_resp calc4 > > [2012-01-25T09:09:52] debug2: node_did_resp calc5 > > [2012-01-25T09:09:52] debug2: node_did_resp calc6 > > [2012-01-25T09:09:52] debug2: node_did_resp calc7 > > [2012-01-25T09:09:52] debug2: node_did_resp calc8 > > [2012-01-25T09:09:52] debug2: node_did_resp calc9 > > [2012-01-25T09:09:52] debug2: node_did_resp calc10 > > [2012-01-25T09:09:52] debug2: node_did_resp calc11 > > [2012-01-25T09:09:52] debug2: node_did_resp calc15 > > [2012-01-25T09:09:52] debug2: node_did_resp calc16 > > [2012-01-25T09:09:52] debug2: node_did_resp calc17 > > [2012-01-25T09:09:52] debug2: node_did_resp calc18 > > [2012-01-25T09:09:52] debug2: node_did_resp calc19 > > [2012-01-25T09:09:52] debug2: node_did_resp calc20 > > [2012-01-25T09:09:52] debug2: node_did_resp calc21 > > [2012-01-25T09:09:52] debug2: node_did_resp calc22 > > [2012-01-25T09:09:52] debug2: node_did_resp calc23 > > [2012-01-25T09:09:52] debug2: node_did_resp calc24 > > [2012-01-25T09:09:54] debug2: Testing job time limits and checkpoints > > [2012-01-25T09:10:24] debug2: Testing job time limits and checkpoints > > [2012-01-25T09:10:24] debug2: Performing purge of old job records > > [2012-01-25T09:10:24] debug2: purge_old_job: purged 1 old job records > > [2012-01-25T09:10:24] debug: sched: Running job scheduler > > [2012-01-25T09:10:49] debug: backfill: no jobs to backfill > > [2012-01-25T09:10:54] debug2: Testing job time limits and checkpoints > > [2012-01-25T09:11:05] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from > uid=907 > > [2012-01-25T09:11:05] debug2: initial priority for job 362 is 10016 > > [2012-01-25T09:11:05] debug2: found 2 usable nodes from config containing > asterix[2-3] > > [2012-01-25T09:11:05] debug2: sched: JobId=362 allocated resources: > NodeList=(null) > > [2012-01-25T09:11:05] _slurm_rpc_submit_batch_job JobId=362 usec=538 > > [2012-01-25T09:11:05] debug: sched: Running job scheduler > > [2012-01-25T09:11:05] debug2: found 2 usable nodes from config containing > asterix[2-3] > > [2012-01-25T09:11:05] sched: Allocate JobId=362 NodeList=asterix2 #CPUs=1 > > [2012-01-25T09:11:05] error: prolog_slurmctld job 362 prolog exit status 1:0 > > [2012-01-25T09:11:05] debug2: Spawning RPC agent for msg_type 6011 > > [2012-01-25T09:11:05] error: slurm_jobcomp plugin context not initialized > > [2012-01-25T09:11:05] debug2: got 1 threads to send out > > [2012-01-25T09:11:05] debug2: Tree head got back 0 looking for 1 > > [2012-01-25T09:11:05] debug2: Tree head got back 1 > > [2012-01-25T09:11:05] debug2: Tree head got them all > > [2012-01-25T09:11:05] requeue batch job 362 > > [2012-01-25T09:11:05] debug2: node_did_resp asterix2 > > [2012-01-25T09:11:05] debug: sched: Running job scheduler > > [2012-01-25T09:11:19] debug: backfill: no jobs to backfill > > [2012-01-25T09:11:24] debug2: Testing job time limits and checkpoints > > [2012-01-25T09:11:24] debug2: Performing purge of old job records > > [2012-01-25T09:11:24] debug: sched: Running job scheduler > > [2012-01-25T09:11:24] debug2: found 2 usable nodes from config containing > asterix[2-3] > > [2012-01-25T09:11:24] sched: Allocate JobId=362 NodeList=asterix2 #CPUs=1 > > [2012-01-25T09:11:24] debug2: Performing full system state save > > [2012-01-25T09:11:24] error: prolog_slurmctld job 362 prolog exit status 1:0 > > [2012-01-25T09:11:24] prolog_slurmctld failed again for job 362 > > [2012-01-25T09:11:24] debug2: Spawning RPC agent for msg_type 4022 > > [2012-01-25T09:11:24] debug2: got 1 threads to send out > > [2012-01-25T09:11:24] debug2: Spawning RPC agent for msg_type 6011 > > [2012-01-25T09:11:24] error: slurm_jobcomp plugin context not initialized > > [2012-01-25T09:11:24] job_signal 9 of running job 362 successful > > [2012-01-25T09:11:24] debug2: got 1 threads to send out > > [2012-01-25T09:11:24] debug2: Tree head got back 0 looking for 1 > > [2012-01-25T09:11:24] debug2: Tree head got back 1 > > [2012-01-25T09:11:24] debug2: Tree head got them all > > [2012-01-25T09:11:24] debug2: node_did_resp asterix2 > > [2012-01-25T09:11:24] debug: sched: Running job scheduler > > [2012-01-25T09:11:24] debug2: node_did_resp asterix2 > > [2012-01-25T09:11:49] debug: backfill: no jobs to backfill > > [2012-01-25T09:11:54] debug2: Testing job time limits and checkpoints > > As You can see: > > [2012-01-25T09:11:24] error: prolog_slurmctld job 362 prolog exit status 1:0 > > [2012-01-25T09:11:24] prolog_slurmctld failed again for job 362 > > In the client node: > > slurmd.log: > > . > > . > > . > > [2012-01-25T09:09:48] got reconfigure request > > [2012-01-25T09:11:05] debug: _rpc_terminate_job, uid = 106 > > [2012-01-25T09:11:05] debug: task_slurmd_release_resources: 362 > > [2012-01-25T09:11:05] debug: credential for job 362 revoked > > [2012-01-25T09:11:24] debug: _rpc_job_notify, uid = 106, jobid = 362 > > [2012-01-25T09:11:24] debug: _rpc_terminate_job, uid = 106 > > [2012-01-25T09:11:24] debug: task_slurmd_release_resources: 362 > > [2012-01-25T09:11:24] debug: job 362 requeued, but started no tasks > > [2012-01-25T09:11:24] debug: credential for job 362 revoked > > Note: I tell the script I am using to show me the SLURM_JOB_PARTITION and > SLURM_JOB_ID variables. I also have set the "Prolog" option and this last > worked good. > > Sincerely, > > Luis Felipe Ruiz Nieto > > -- -- Carles Fenoy