Re: [slurm-dev] Problem with PrologSlurmctld

Moe Jette Wed, 25 Jan 2012 08:01:46 -0800

Also make sure the script's exit code is zero to continue. Anynon-zero exit code is treated like an error condition.


Quoting Carles Fenoy <mini...@gmail.com>:

Hola Luis,

Have you tryed to execute the slurmctldprolog command as the SlurmUser
in the controller node? Is it working?

Carles Fenoy

On Wed, Jan 25, 2012 at 2:10 PM, luis <luis.r...@uam.es> wrote:

Dear All:


I am trying slurm to Use PrologSlurmctld, To do this in the configuration
file I have enabled the variable:

PrologSlurmctld=/usr/local/etc/bin/prologoslurmctld

The problem I get is that when I send a job to the queue system is that the
work remains pending and never comes launches.

-bash-3.2$ /usr/local/slurm-2.3.2/bin/sbatch -p sec4000 --qos=sec4000
lanza09-2-b agua

Submitted batch job 364

-bash-3.2$ /usr/local/slurm-2.3.2/bin/squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

364 sec4000 lanza09- lfelipe PD 0:00 1 (None)

-bash-3.2$ /usr/local/slurm-2.3.2/bin/scontrol show job 364

JobId=364 Name=lanza09-2-b

UserId=lfelipe(907) GroupId=root(0)

Priority=10016 Account=cccuam QOS=sec4000

JobState=PENDING Reason=None Dependency=(null)

Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0

RunTime=00:00:00 TimeLimit=5-00:00:00 TimeMin=N/A

SubmitTime=2012-01-25T10:17:26 EligibleTime=2012-01-25T10:17:36

StartTime=Unknown EndTime=Unknown

PreemptTime=None SuspendTime=None SecsPreSuspend=0

Partition=sec4000 AllocNode:Sid=terpsichore:2435

ReqNodeList=(null) ExcNodeList=(null)

NodeList=(null)

BatchHost=asterix2

NumNodes=1 NumCPUs=1-1 CPUs/Task=1 ReqS:C:T=*:*:*

MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

Features=(null) Gres=(null) Reservation=(null)

Shared=OK Contiguous=0 Licenses=(null) Network=(null)

Command=/home/lfelipe/pruebas/slurm/gaussian/lanza09-2-b agua

WorkDir=/home/lfelipe/pruebas/slurm/gaussian

As you can see:

Slurmctld.log:

.

.

.

[2012-01-25T09:09:52] debug2: node_did_resp calc1

[2012-01-25T09:09:52] debug2: node_did_resp calc2

[2012-01-25T09:09:52] debug2: node_did_resp calc3

[2012-01-25T09:09:52] debug2: node_did_resp calc4

[2012-01-25T09:09:52] debug2: node_did_resp calc5

[2012-01-25T09:09:52] debug2: node_did_resp calc6

[2012-01-25T09:09:52] debug2: node_did_resp calc7

[2012-01-25T09:09:52] debug2: node_did_resp calc8

[2012-01-25T09:09:52] debug2: node_did_resp calc9

[2012-01-25T09:09:52] debug2: node_did_resp calc10

[2012-01-25T09:09:52] debug2: node_did_resp calc11

[2012-01-25T09:09:52] debug2: node_did_resp calc15

[2012-01-25T09:09:52] debug2: node_did_resp calc16

[2012-01-25T09:09:52] debug2: node_did_resp calc17

[2012-01-25T09:09:52] debug2: node_did_resp calc18

[2012-01-25T09:09:52] debug2: node_did_resp calc19

[2012-01-25T09:09:52] debug2: node_did_resp calc20

[2012-01-25T09:09:52] debug2: node_did_resp calc21

[2012-01-25T09:09:52] debug2: node_did_resp calc22

[2012-01-25T09:09:52] debug2: node_did_resp calc23

[2012-01-25T09:09:52] debug2: node_did_resp calc24

[2012-01-25T09:09:54] debug2: Testing job time limits and checkpoints

[2012-01-25T09:10:24] debug2: Testing job time limits and checkpoints

[2012-01-25T09:10:24] debug2: Performing purge of old job records

[2012-01-25T09:10:24] debug2: purge_old_job: purged 1 old job records

[2012-01-25T09:10:24] debug:  sched: Running job scheduler

[2012-01-25T09:10:49] debug:  backfill: no jobs to backfill

[2012-01-25T09:10:54] debug2: Testing job time limits and checkpoints

[2012-01-25T09:11:05] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from
uid=907

[2012-01-25T09:11:05] debug2: initial priority for job 362 is 10016

[2012-01-25T09:11:05] debug2: found 2 usable nodes from config containing
asterix[2-3]

[2012-01-25T09:11:05] debug2: sched: JobId=362 allocated resources:
NodeList=(null)

[2012-01-25T09:11:05] _slurm_rpc_submit_batch_job JobId=362 usec=538

[2012-01-25T09:11:05] debug:  sched: Running job scheduler

[2012-01-25T09:11:05] debug2: found 2 usable nodes from config containing
asterix[2-3]

[2012-01-25T09:11:05] sched: Allocate JobId=362 NodeList=asterix2 #CPUs=1

[2012-01-25T09:11:05] error: prolog_slurmctld job 362 prolog exit status 1:0

[2012-01-25T09:11:05] debug2: Spawning RPC agent for msg_type 6011

[2012-01-25T09:11:05] error: slurm_jobcomp plugin context not initialized

[2012-01-25T09:11:05] debug2: got 1 threads to send out

[2012-01-25T09:11:05] debug2: Tree head got back 0 looking for 1

[2012-01-25T09:11:05] debug2: Tree head got back 1

[2012-01-25T09:11:05] debug2: Tree head got them all

[2012-01-25T09:11:05] requeue batch job 362

[2012-01-25T09:11:05] debug2: node_did_resp asterix2

[2012-01-25T09:11:05] debug:  sched: Running job scheduler

[2012-01-25T09:11:19] debug:  backfill: no jobs to backfill

[2012-01-25T09:11:24] debug2: Testing job time limits and checkpoints

[2012-01-25T09:11:24] debug2: Performing purge of old job records

[2012-01-25T09:11:24] debug:  sched: Running job scheduler

[2012-01-25T09:11:24] debug2: found 2 usable nodes from config containing
asterix[2-3]

[2012-01-25T09:11:24] sched: Allocate JobId=362 NodeList=asterix2 #CPUs=1

[2012-01-25T09:11:24] debug2: Performing full system state save

[2012-01-25T09:11:24] error: prolog_slurmctld job 362 prolog exit status 1:0

[2012-01-25T09:11:24] prolog_slurmctld failed again for job 362

[2012-01-25T09:11:24] debug2: Spawning RPC agent for msg_type 4022

[2012-01-25T09:11:24] debug2: got 1 threads to send out

[2012-01-25T09:11:24] debug2: Spawning RPC agent for msg_type 6011

[2012-01-25T09:11:24] error: slurm_jobcomp plugin context not initialized

[2012-01-25T09:11:24] job_signal 9 of running job 362 successful

[2012-01-25T09:11:24] debug2: got 1 threads to send out

[2012-01-25T09:11:24] debug2: Tree head got back 0 looking for 1

[2012-01-25T09:11:24] debug2: Tree head got back 1

[2012-01-25T09:11:24] debug2: Tree head got them all

[2012-01-25T09:11:24] debug2: node_did_resp asterix2

[2012-01-25T09:11:24] debug:  sched: Running job scheduler

[2012-01-25T09:11:24] debug2: node_did_resp asterix2

[2012-01-25T09:11:49] debug:  backfill: no jobs to backfill

[2012-01-25T09:11:54] debug2: Testing job time limits and checkpoints

As You can see:

[2012-01-25T09:11:24] error: prolog_slurmctld job 362 prolog exit status 1:0

[2012-01-25T09:11:24] prolog_slurmctld failed again for job 362

In the client node:

slurmd.log:

.

.

.

[2012-01-25T09:09:48] got reconfigure request

[2012-01-25T09:11:05] debug:  _rpc_terminate_job, uid = 106

[2012-01-25T09:11:05] debug:  task_slurmd_release_resources: 362

[2012-01-25T09:11:05] debug:  credential for job 362 revoked

[2012-01-25T09:11:24] debug:  _rpc_job_notify, uid = 106, jobid = 362

[2012-01-25T09:11:24] debug:  _rpc_terminate_job, uid = 106

[2012-01-25T09:11:24] debug:  task_slurmd_release_resources: 362

[2012-01-25T09:11:24] debug:  job 362 requeued, but started no tasks

[2012-01-25T09:11:24] debug:  credential for job 362 revoked

Note: I tell the script I am using to show me the SLURM_JOB_PARTITION and
SLURM_JOB_ID variables. I also have set the "Prolog" option and this last
worked good.

Sincerely,

Luis Felipe Ruiz Nieto




--
--
Carles Fenoy

Re: [slurm-dev] Problem with PrologSlurmctld

Reply via email to