Thanks, I've tried padb first to get stack traces. This is from IMB-MPI1 hanging after one hour, the last output was: # Benchmarking Alltoall # #processes = 1024 #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.04 0.09 0.05 1 1000 253.40 335.35 293.06 2 1000 266.93 346.65 306.23 4 1000 303.52 382.41 342.21 8 1000 383.89 493.56 439.34 16 1000 501.27 627.84 569.80 32 1000 1039.65 1259.70 1163.12 64 1000 1710.12 2071.47 1910.62 128 1000 3051.68 3653.44 3398.65
On Fri, Dec 1, 2017 at 4:23 PM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> wrote: > FWIW, > > pstack <pid> > Is a gdb wrapper that displays the stack trace. > > PADB http://padb.pittman.org.uk is a great OSS that automatically collect > the stack traces of all the MPI tasks (and can do some grouping similar to > dshbak) > > Cheers, > > Gilles > > > Noam Bernstein <noam.bernst...@nrl.navy.mil> wrote: > > On Dec 1, 2017, at 8:10 AM, Götz Waschk <goetz.was...@gmail.com> wrote: > > On Fri, Dec 1, 2017 at 10:13 AM, Götz Waschk <goetz.was...@gmail.com> wrote: > > I have attached my slurm job script, it will simply do an mpirun > IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for > instance, vader is enabled. > > I have tested again, with > mpirun --mca btl "^vader" IMB-MPI1 > it made no difference. > > > I’ve lost track of the earlier parts of this thread, but has anyone > suggested logging into the nodes it’s running on, doing “gdb -p PID” for > each of the mpi processes, and doing “where” to see where it’s hanging? > > I use this script (trace_all), which depends on a variable process that is a > grep regexp that matches the mpi executable: > > echo "where" > /tmp/gf > > pids=`ps aux | grep $process | grep -v grep | grep -v trace_all | awk > '{print \$2}'` > for pid in $pids; do > echo $pid > prog=`ps auxw | grep " $pid " | grep -v grep | awk '{print $11}'` > gdb -x /tmp/gf -batch $prog $pid > echo "" > done > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- AL I:40: Do what thou wilt shall be the whole of the Law.
Stack trace(s) for thread: 1 ----------------- [0-1023] (1024 processes) ----------------- main() at ?:? IMB_init_buffers_iter() at ?:? IMB_alltoall() at ?:? ----------------- [0-31,35,42,118,163,235] (37 processes) ----------------- PMPI_Barrier() at ?:? ompi_coll_base_barrier_intra_recursivedoubling() at ?:? ompi_request_default_wait() at ?:? opal_progress() at ?:? ----------------- [32-34,36-41,43-117,119-162,164-234,236-1023] (987 processes) ----------------- PMPI_Alltoall() at ?:? ompi_coll_base_alltoall_intra_basic_linear() at ?:? ompi_request_default_wait_all() at ?:? ----------------- [32-34,36-41,43-117,119-162,164-234,236-413,415-532,534-651,653-744,746-894,896-1023] (982 processes) ----------------- opal_progress() at ?:? ----------------- [533] (1 processes) ----------------- opal_progress@plt() at ?:? Stack trace(s) for thread: 2 ----------------- [0-1023] (1024 processes) ----------------- start_thread() at ?:? progress_engine() at ?:? opal_libevent2022_event_base_loop() at event.c:1630 epoll_dispatch() at epoll.c:407 epoll_wait() at ?:? Stack trace(s) for thread: 3 ----------------- [0-1023] (1024 processes) ----------------- start_thread() at ?:? progress_engine() at ?:? opal_libevent2022_event_base_loop() at event.c:1630 poll_dispatch() at poll.c:165 poll() at ?:?
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users