Be default, OMPI will bind your procs to a single core. You probably want to at least bind to socket (for NUMA reasons), or not bind at all if you want to use all the cores on the node.
So either add "--bind-to socket" or "--bind-to none" to your cmd line. On Aug 3, 2020, at 1:33 AM, John Duffy via users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > wrote: Hi I’m experimenting with hybrid OpenMPI/OpenMP Linpack benchmarks on my small cluster, and I’m a bit confused as to how to invoke mpirun. I have compiled/linked HPL-2.3 with OpenMPI and libopenblas-openmp using the GCC -fopenmp option on Ubuntu 20.04 64-bit. With P=1 and Q=1 in HPL.dat, if I use… mpirun -x OMP_NUM_THREADS=4 xhpl top reports... top - 08:03:59 up 1 day, 0 min, 1 user, load average: 2.25, 1.23, 0.88 Tasks: 138 total, 2 running, 136 sleeping, 0 stopped, 0 zombie %Cpu(s): 77.1 us, 22.2 sy, 0.0 ni, 0.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 3793.3 total, 434.0 free, 2814.1 used, 545.2 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 919.9 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5787 john 20 0 2959408 2.6g 8128 R 354.0 69.1 2:10.43 xhpl 5789 john 20 0 263352 9960 7440 S 14.2 0.3 0:07.42 xhpl 5788 john 20 0 263352 9844 7320 S 13.9 0.3 0:07.19 xhpl 5790 john 20 0 263356 9896 7376 S 13.6 0.3 0:07.17 xhpl … which seems reasonable, but I don’t understand why there are 4 xhpl processes. In anticipation of adding more nodes, if I use… mpirun --host node1 --map-by ppr:1:node -x OMP_NUM_THREADS=4 xhpl top reports... top - 07:56:27 up 23:52, 1 user, load average: 1.00, 0.98, 0.68 Tasks: 133 total, 2 running, 131 sleeping, 0 stopped, 0 zombie %Cpu(s): 25.1 us, 0.0 sy, 0.0 ni, 74.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 3793.3 total, 454.2 free, 2794.5 used, 544.7 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 939.9 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5770 john 20 0 2868700 2.5g 7668 R 99.7 68.7 5:20.37 xhpl … a single xhpl process (as expected), but with only 25% CPU utilisation and no other processes running on the other 3 cores. It would appear OpenBLAS is not utilising the 4 cores as expected. If I then scale it to 2 nodes, with P=1 and Q=2 in HPL.dat... mpirun --host node1,node2 --map-by ppr:1:node -x OMP_NUM_THREADS=4 xhpl … similarly, I get a single process on each node, with only 25% CPU utilisation. Any advice/suggestions on how to involve mpirun in a hybrid OpenMPI/OpenMP setup would be appreciated. Kind regards