Re: [OMPI users] Binding blocks of processes in round-robin fashion

2021-01-29 Thread Ralph Castain via users
Okay, I can't promise when I'll get to it, but I'll try to have it in time for OMPI v5. On Jan 29, 2021, at 1:30 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote: Hi Ralph, It would be great to have it for load balancing issues. Ideally one could do something like

[OMPI users] High errorcode message

2021-01-29 Thread Arturo Fernandez via users
Hello, My system is running CentOS8 & OpenMPI v4.1.0. Most stuff is working fine but one app is aborting with: MPI_ABORT was invoked on rank 7 in communicator MPI_COMM_WORLD with errorcode 1734831948. The other 23 MPI ranks also abort. I'm a bit confused by the high error code. Does it mean

Re: [OMPI users] High errorcode message

2021-01-29 Thread Jeff Squyres (jsquyres) via users
It's somewhat hard to say without more information. What is your app doing when it calls abort? On Jan 29, 2021, at 8:49 PM, Arturo Fernandez via users mailto:users@lists.open-mpi.org>> wrote: Hello, My system is running CentOS8 & OpenMPI v4.1.0. Most stuff is working fine but one app is

Re: [OMPI users] Binding blocks of processes in round-robin fashion

2021-01-29 Thread Luis Cebamanos via users
Hi Ralph, It would be great to have it for load balancing issues. Ideally one could do something like --bind-to:N where N is the block size, 4 in this case. mpirun -np 40  --map-by ppr:40:node  --bind-to core:4  I think it would be interesting to have it. Of course, I can always use srun but

Re: [OMPI users] Debugging a crash

2021-01-29 Thread Gilles Gouaillardet via users
Diego, the mpirun command line starts 2 MPI task, but the error log mentions rank 56, so unless there is a copy/paste error, this is highly suspicious. I invite you to check the filesystem usage on this node, and make sure there is a similar amount of available space in /tmp and /dev/shm (or

[OMPI users] Debugging a crash

2021-01-29 Thread Diego Zuccato via users
Hello all. I'm having a problem with a job: if it gets scheduled on a specific node of our cluster, it fails with: -8<-- -- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction,