SLURM Dev,

I'm hoping you can help me with something. I recently had need to figure out what's going on inside a running batch job. I had a suspicion that maybe a job that should use, say, 48 cores on a few nodes managed to somehow pack them all on a single node due to my idiocy with an mpirun command.

So, I have a job-id and I know I can do, say:

  srun --jobid=<JOBID> ps -ef

and I'll get a ps on the nodes in that allocation. But, if that allocation has, say, 14 nodes, I get 14 nodes worth of information that is hard to parse out since ps doesn't prepend/print hostname[1].

I thought maybe there is a way to run the srun command on just one of the nodes in the allocation and I tried:

  srun --jobid=<JOBID> --nodelist=node1 ps -ef

where node1 is one of the nodes in the allocation. But, no, that doesn't seem to do what I'd hoped as I still get every node running ps.

Now, I'm sure I could whip up a bash script which tests for the hostname and runs a command only if that matches the one I want, but I was hoping for a nice simple way with srun itself to do this.

Matt

[1] That I know of. I didn't see "hostname" in the ps manpage.
--
Matt Thompson          SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712              Fax: 301-614-6246

Reply via email to