Re: [slurm-dev] Problem with PrologSlurmctld

Carles Fenoy Wed, 25 Jan 2012 06:24:34 -0800

Hola Luis,

Have you tryed to execute the slurmctldprolog command as the SlurmUser
in the controller node? Is it working?


Carles Fenoy

On Wed, Jan 25, 2012 at 2:10 PM, luis <luis.r...@uam.es> wrote:
> Dear All:
>
>
> I am trying slurm to Use PrologSlurmctld, To do this in the configuration
> file I have enabled the variable:
>
> PrologSlurmctld=/usr/local/etc/bin/prologoslurmctld
>
> The problem I get is that when I send a job to the queue system is that the
> work remains pending and never comes launches.
>
> -bash-3.2$ /usr/local/slurm-2.3.2/bin/sbatch -p sec4000 --qos=sec4000
> lanza09-2-b agua
>
> Submitted batch job 364
>
> -bash-3.2$ /usr/local/slurm-2.3.2/bin/squeue
>
> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
>
> 364 sec4000 lanza09- lfelipe PD 0:00 1 (None)
>
> -bash-3.2$ /usr/local/slurm-2.3.2/bin/scontrol show job 364
>
> JobId=364 Name=lanza09-2-b
>
> UserId=lfelipe(907) GroupId=root(0)
>
> Priority=10016 Account=cccuam QOS=sec4000
>
> JobState=PENDING Reason=None Dependency=(null)
>
> Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0
>
> RunTime=00:00:00 TimeLimit=5-00:00:00 TimeMin=N/A
>
> SubmitTime=2012-01-25T10:17:26 EligibleTime=2012-01-25T10:17:36
>
> StartTime=Unknown EndTime=Unknown
>
> PreemptTime=None SuspendTime=None SecsPreSuspend=0
>
> Partition=sec4000 AllocNode:Sid=terpsichore:2435
>
> ReqNodeList=(null) ExcNodeList=(null)
>
> NodeList=(null)
>
> BatchHost=asterix2
>
> NumNodes=1 NumCPUs=1-1 CPUs/Task=1 ReqS:C:T=*:*:*
>
> MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>
> Features=(null) Gres=(null) Reservation=(null)
>
> Shared=OK Contiguous=0 Licenses=(null) Network=(null)
>
> Command=/home/lfelipe/pruebas/slurm/gaussian/lanza09-2-b agua
>
> WorkDir=/home/lfelipe/pruebas/slurm/gaussian
>
> As you can see:
>
> Slurmctld.log:
>
> .
>
> .
>
> .
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc1
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc2
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc3
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc4
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc5
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc6
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc7
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc8
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc9
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc10
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc11
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc15
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc16
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc17
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc18
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc19
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc20
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc21
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc22
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc23
>
> [2012-01-25T09:09:52] debug2: node_did_resp calc24
>
> [2012-01-25T09:09:54] debug2: Testing job time limits and checkpoints
>
> [2012-01-25T09:10:24] debug2: Testing job time limits and checkpoints
>
> [2012-01-25T09:10:24] debug2: Performing purge of old job records
>
> [2012-01-25T09:10:24] debug2: purge_old_job: purged 1 old job records
>
> [2012-01-25T09:10:24] debug:  sched: Running job scheduler
>
> [2012-01-25T09:10:49] debug:  backfill: no jobs to backfill
>
> [2012-01-25T09:10:54] debug2: Testing job time limits and checkpoints
>
> [2012-01-25T09:11:05] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from
> uid=907
>
> [2012-01-25T09:11:05] debug2: initial priority for job 362 is 10016
>
> [2012-01-25T09:11:05] debug2: found 2 usable nodes from config containing
> asterix[2-3]
>
> [2012-01-25T09:11:05] debug2: sched: JobId=362 allocated resources:
> NodeList=(null)
>
> [2012-01-25T09:11:05] _slurm_rpc_submit_batch_job JobId=362 usec=538
>
> [2012-01-25T09:11:05] debug:  sched: Running job scheduler
>
> [2012-01-25T09:11:05] debug2: found 2 usable nodes from config containing
> asterix[2-3]
>
> [2012-01-25T09:11:05] sched: Allocate JobId=362 NodeList=asterix2 #CPUs=1
>
> [2012-01-25T09:11:05] error: prolog_slurmctld job 362 prolog exit status 1:0
>
> [2012-01-25T09:11:05] debug2: Spawning RPC agent for msg_type 6011
>
> [2012-01-25T09:11:05] error: slurm_jobcomp plugin context not initialized
>
> [2012-01-25T09:11:05] debug2: got 1 threads to send out
>
> [2012-01-25T09:11:05] debug2: Tree head got back 0 looking for 1
>
> [2012-01-25T09:11:05] debug2: Tree head got back 1
>
> [2012-01-25T09:11:05] debug2: Tree head got them all
>
> [2012-01-25T09:11:05] requeue batch job 362
>
> [2012-01-25T09:11:05] debug2: node_did_resp asterix2
>
> [2012-01-25T09:11:05] debug:  sched: Running job scheduler
>
> [2012-01-25T09:11:19] debug:  backfill: no jobs to backfill
>
> [2012-01-25T09:11:24] debug2: Testing job time limits and checkpoints
>
> [2012-01-25T09:11:24] debug2: Performing purge of old job records
>
> [2012-01-25T09:11:24] debug:  sched: Running job scheduler
>
> [2012-01-25T09:11:24] debug2: found 2 usable nodes from config containing
> asterix[2-3]
>
> [2012-01-25T09:11:24] sched: Allocate JobId=362 NodeList=asterix2 #CPUs=1
>
> [2012-01-25T09:11:24] debug2: Performing full system state save
>
> [2012-01-25T09:11:24] error: prolog_slurmctld job 362 prolog exit status 1:0
>
> [2012-01-25T09:11:24] prolog_slurmctld failed again for job 362
>
> [2012-01-25T09:11:24] debug2: Spawning RPC agent for msg_type 4022
>
> [2012-01-25T09:11:24] debug2: got 1 threads to send out
>
> [2012-01-25T09:11:24] debug2: Spawning RPC agent for msg_type 6011
>
> [2012-01-25T09:11:24] error: slurm_jobcomp plugin context not initialized
>
> [2012-01-25T09:11:24] job_signal 9 of running job 362 successful
>
> [2012-01-25T09:11:24] debug2: got 1 threads to send out
>
> [2012-01-25T09:11:24] debug2: Tree head got back 0 looking for 1
>
> [2012-01-25T09:11:24] debug2: Tree head got back 1
>
> [2012-01-25T09:11:24] debug2: Tree head got them all
>
> [2012-01-25T09:11:24] debug2: node_did_resp asterix2
>
> [2012-01-25T09:11:24] debug:  sched: Running job scheduler
>
> [2012-01-25T09:11:24] debug2: node_did_resp asterix2
>
> [2012-01-25T09:11:49] debug:  backfill: no jobs to backfill
>
> [2012-01-25T09:11:54] debug2: Testing job time limits and checkpoints
>
> As You can see:
>
> [2012-01-25T09:11:24] error: prolog_slurmctld job 362 prolog exit status 1:0
>
> [2012-01-25T09:11:24] prolog_slurmctld failed again for job 362
>
> In the client node:
>
> slurmd.log:
>
> .
>
> .
>
> .
>
> [2012-01-25T09:09:48] got reconfigure request
>
> [2012-01-25T09:11:05] debug:  _rpc_terminate_job, uid = 106
>
> [2012-01-25T09:11:05] debug:  task_slurmd_release_resources: 362
>
> [2012-01-25T09:11:05] debug:  credential for job 362 revoked
>
> [2012-01-25T09:11:24] debug:  _rpc_job_notify, uid = 106, jobid = 362
>
> [2012-01-25T09:11:24] debug:  _rpc_terminate_job, uid = 106
>
> [2012-01-25T09:11:24] debug:  task_slurmd_release_resources: 362
>
> [2012-01-25T09:11:24] debug:  job 362 requeued, but started no tasks
>
> [2012-01-25T09:11:24] debug:  credential for job 362 revoked
>
> Note: I tell the script I am using to show me the SLURM_JOB_PARTITION and
> SLURM_JOB_ID variables. I also have set the "Prolog" option and this last
> worked good.
>
> Sincerely,
>
> Luis Felipe Ruiz Nieto
>
>



-- 
--
Carles Fenoy

Re: [slurm-dev] Problem with PrologSlurmctld

Reply via email to