SLURM Dev,
I'm hoping you can help me with something. I recently had need to figure
out what's going on inside a running batch job. I had a suspicion that
maybe a job that should use, say, 48 cores on a few nodes managed to
somehow pack them all on a single node due to my idiocy with an mpirun
command.
So, I have a job-id and I know I can do, say:
srun --jobid=<JOBID> ps -ef
and I'll get a ps on the nodes in that allocation. But, if that
allocation has, say, 14 nodes, I get 14 nodes worth of information that
is hard to parse out since ps doesn't prepend/print hostname[1].
I thought maybe there is a way to run the srun command on just one of
the nodes in the allocation and I tried:
srun --jobid=<JOBID> --nodelist=node1 ps -ef
where node1 is one of the nodes in the allocation. But, no, that doesn't
seem to do what I'd hoped as I still get every node running ps.
Now, I'm sure I could whip up a bash script which tests for the hostname
and runs a command only if that matches the one I want, but I was hoping
for a nice simple way with srun itself to do this.
Matt
[1] That I know of. I didn't see "hostname" in the ps manpage.
--
Matt Thompson SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246