Most of the RPC are logged using debug2() messages, so increasing the  
SlurmctldDebug value by one should make this much more clear and you  
should see something like this:

Job allocation:
slurmctld: debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION  
from uid=1
001
slurmctld: debug2: found 4 usable nodes from config containing smd[1-4]
slurmctld: debug2: sched: JobId=1267485 allocated resources: NodeList=smd1
slurmctld: sched: _slurm_rpc_allocate_resources JobId=1267485  
NodeList=smd1 usec=198
slurmctld: debug2: _slurm_rpc_job_ready(1267485)=3 usec=3

Step allocation:
slurmctld: debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=1001
slurmctld: debug:  Configuration for job 1267485 complete

Task layout (done as part of step creation):
slurmctld: debug:  laying out the 1 tasks on 1 hosts smd1 dist 1
slurmctld: sched: _slurm_rpc_job_step_create: StepId=1267485.0 smd1 usec=224

Step termination:
slurmctld: debug:  Processing RPC: REQUEST_STEP_COMPLETE for 1267485.0  
nodes 0-0
  rc=0 uid=0
slurmctld: sched: _slurm_rpc_step_complete StepId=1267485.0 usec=56

Job termination
slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_JOB_ALLOCATION  
from uid=1001, JobId=1267485 rc=0
slurmctld: completing job 1267485
slurmctld: debug2: Spawning RPC agent for msg_type 7004
slurmctld: debug2: Spawning RPC agent for msg_type 6011
slurmctld: sched: job_complete for JobId=1267485 successful
slurmctld: debug2: _slurm_rpc_complete_job_allocation JobId=1267485 usec=270
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got them all
slurmctld: debug2: node_did_resp smd1
slurmctld: debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0
slurmctld: debug2: _slurm_rpc_epilog_complete JobId=1267485 Node=smd1 usec=45


Quoting Sergio Iserte Agut <[email protected]>:

> Hello,
>
> I'm trying to understand what happen when a job is submitted.
>
> On the one hand, I've read the *Job Launch Design Guide* (
> https://computing.llnl.gov/linux/slurm/job_launch.html) and I've found that
> there are five points:
>
>    - Job allocation
>    - Step allocation
>    - Task allocation
>    - Job Step termination
>    - Job termination
>
>
> On the other hand, when we submit a job, for instance:
>
>> srun --gres=gpu:1  sleep 5
>
> we are able to see the next output in our log file:
>
>> sched: _slurm_rpc_allocate_resources JobId=238 NodeList=matraca1 usec=244
>> debug:  Configuration for job 238 complete
>> debug:  laying out the 1 tasks on 1 hosts matraca1 dist 1
>> sched: _slurm_rpc_job_step_create: StepId=238.0 matraca1 usec=456
>> debug:  Processing RPC: REQUEST_STEP_COMPLETE for 238.0 nodes 0-0 rc=0
>> uid=0
>> sched: _slurm_rpc_step_complete StepId=238.0 usec=48
>> completing job 238
>> sched: job_complete for JobId=238 successful
>
>
> I believe they should have a relation but, can anybody say which lines from
> the debug file are related to each point of the job launch process?
>
> Moreover, I'd like to know which are the role of the job steps in an
> execution (or how do they work?).
>
> Thank you,
>     Regards!
> --
> *--
> *
> *Sergio Iserte Agut, assistant researcher,*
> *High Performance Computing & Architecture, University Jaume I (Spain)*
>

Reply via email to