Great! You've done a good job! Thank you! 2012/6/6 Moe Jette <[email protected]>
> > Most of the RPC are logged using debug2() messages, so increasing the > SlurmctldDebug value by one should make this much more clear and you > should see something like this: > > Job allocation: > slurmctld: debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION > from uid=1 > 001 > slurmctld: debug2: found 4 usable nodes from config containing smd[1-4] > slurmctld: debug2: sched: JobId=1267485 allocated resources: NodeList=smd1 > slurmctld: sched: _slurm_rpc_allocate_resources JobId=1267485 > NodeList=smd1 usec=198 > slurmctld: debug2: _slurm_rpc_job_ready(1267485)=3 usec=3 > > Step allocation: > slurmctld: debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=1001 > slurmctld: debug: Configuration for job 1267485 complete > > Task layout (done as part of step creation): > slurmctld: debug: laying out the 1 tasks on 1 hosts smd1 dist 1 > slurmctld: sched: _slurm_rpc_job_step_create: StepId=1267485.0 smd1 > usec=224 > > Step termination: > slurmctld: debug: Processing RPC: REQUEST_STEP_COMPLETE for 1267485.0 > nodes 0-0 > rc=0 uid=0 > slurmctld: sched: _slurm_rpc_step_complete StepId=1267485.0 usec=56 > > Job termination > slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_JOB_ALLOCATION > from uid=1001, JobId=1267485 rc=0 > slurmctld: completing job 1267485 > slurmctld: debug2: Spawning RPC agent for msg_type 7004 > slurmctld: debug2: Spawning RPC agent for msg_type 6011 > slurmctld: sched: job_complete for JobId=1267485 successful > slurmctld: debug2: _slurm_rpc_complete_job_allocation JobId=1267485 > usec=270 > slurmctld: debug2: got 1 threads to send out > slurmctld: debug2: got 1 threads to send out > slurmctld: debug2: Tree head got back 0 looking for 1 > slurmctld: debug2: Tree head got back 1 > slurmctld: debug2: Tree head got them all > slurmctld: debug2: node_did_resp smd1 > slurmctld: debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0 > slurmctld: debug2: _slurm_rpc_epilog_complete JobId=1267485 Node=smd1 > usec=45 > > > Quoting Sergio Iserte Agut <[email protected]>: > > > Hello, > > > > I'm trying to understand what happen when a job is submitted. > > > > On the one hand, I've read the *Job Launch Design Guide* ( > > https://computing.llnl.gov/linux/slurm/job_launch.html) and I've found > that > > there are five points: > > > > - Job allocation > > - Step allocation > > - Task allocation > > - Job Step termination > > - Job termination > > > > > > On the other hand, when we submit a job, for instance: > > > >> srun --gres=gpu:1 sleep 5 > > > > we are able to see the next output in our log file: > > > >> sched: _slurm_rpc_allocate_resources JobId=238 NodeList=matraca1 > usec=244 > >> debug: Configuration for job 238 complete > >> debug: laying out the 1 tasks on 1 hosts matraca1 dist 1 > >> sched: _slurm_rpc_job_step_create: StepId=238.0 matraca1 usec=456 > >> debug: Processing RPC: REQUEST_STEP_COMPLETE for 238.0 nodes 0-0 rc=0 > >> uid=0 > >> sched: _slurm_rpc_step_complete StepId=238.0 usec=48 > >> completing job 238 > >> sched: job_complete for JobId=238 successful > > > > > > I believe they should have a relation but, can anybody say which lines > from > > the debug file are related to each point of the job launch process? > > > > Moreover, I'd like to know which are the role of the job steps in an > > execution (or how do they work?). > > > > Thank you, > > Regards! > > -- > > *-- > > * > > *Sergio Iserte Agut, assistant researcher,* > > *High Performance Computing & Architecture, University Jaume I (Spain)* > > > > -- *-- * *Sergio Iserte Agut, assistant researcher,* *High Performance Computing & Architecture, University Jaume I (Spain)*
