Most of the RPC are logged using debug2() messages, so increasing the SlurmctldDebug value by one should make this much more clear and you should see something like this:
Job allocation: slurmctld: debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=1 001 slurmctld: debug2: found 4 usable nodes from config containing smd[1-4] slurmctld: debug2: sched: JobId=1267485 allocated resources: NodeList=smd1 slurmctld: sched: _slurm_rpc_allocate_resources JobId=1267485 NodeList=smd1 usec=198 slurmctld: debug2: _slurm_rpc_job_ready(1267485)=3 usec=3 Step allocation: slurmctld: debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=1001 slurmctld: debug: Configuration for job 1267485 complete Task layout (done as part of step creation): slurmctld: debug: laying out the 1 tasks on 1 hosts smd1 dist 1 slurmctld: sched: _slurm_rpc_job_step_create: StepId=1267485.0 smd1 usec=224 Step termination: slurmctld: debug: Processing RPC: REQUEST_STEP_COMPLETE for 1267485.0 nodes 0-0 rc=0 uid=0 slurmctld: sched: _slurm_rpc_step_complete StepId=1267485.0 usec=56 Job termination slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_JOB_ALLOCATION from uid=1001, JobId=1267485 rc=0 slurmctld: completing job 1267485 slurmctld: debug2: Spawning RPC agent for msg_type 7004 slurmctld: debug2: Spawning RPC agent for msg_type 6011 slurmctld: sched: job_complete for JobId=1267485 successful slurmctld: debug2: _slurm_rpc_complete_job_allocation JobId=1267485 usec=270 slurmctld: debug2: got 1 threads to send out slurmctld: debug2: got 1 threads to send out slurmctld: debug2: Tree head got back 0 looking for 1 slurmctld: debug2: Tree head got back 1 slurmctld: debug2: Tree head got them all slurmctld: debug2: node_did_resp smd1 slurmctld: debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0 slurmctld: debug2: _slurm_rpc_epilog_complete JobId=1267485 Node=smd1 usec=45 Quoting Sergio Iserte Agut <[email protected]>: > Hello, > > I'm trying to understand what happen when a job is submitted. > > On the one hand, I've read the *Job Launch Design Guide* ( > https://computing.llnl.gov/linux/slurm/job_launch.html) and I've found that > there are five points: > > - Job allocation > - Step allocation > - Task allocation > - Job Step termination > - Job termination > > > On the other hand, when we submit a job, for instance: > >> srun --gres=gpu:1 sleep 5 > > we are able to see the next output in our log file: > >> sched: _slurm_rpc_allocate_resources JobId=238 NodeList=matraca1 usec=244 >> debug: Configuration for job 238 complete >> debug: laying out the 1 tasks on 1 hosts matraca1 dist 1 >> sched: _slurm_rpc_job_step_create: StepId=238.0 matraca1 usec=456 >> debug: Processing RPC: REQUEST_STEP_COMPLETE for 238.0 nodes 0-0 rc=0 >> uid=0 >> sched: _slurm_rpc_step_complete StepId=238.0 usec=48 >> completing job 238 >> sched: job_complete for JobId=238 successful > > > I believe they should have a relation but, can anybody say which lines from > the debug file are related to each point of the job launch process? > > Moreover, I'd like to know which are the role of the job steps in an > execution (or how do they work?). > > Thank you, > Regards! > -- > *-- > * > *Sergio Iserte Agut, assistant researcher,* > *High Performance Computing & Architecture, University Jaume I (Spain)* >
