I suspect it is okay. Keep in mind that OMPI itself is starting multiple 
progress threads, so that is likely what you are seeing. The binding patter in 
the mpirun output looks correct as the default would be to map-by socket and 
you asked that we bind-to core.


> On Jun 22, 2018, at 9:33 AM, Noam Bernstein <noam.bernst...@nrl.navy.mil> 
> wrote:
> 
> Hi - for the last couple of weeks, more or less since we did some kernel 
> updates, certain compute intensive MPI jobs have been behaving oddly as far 
> as their speed - bits that should be quite fast sometimes (but not 
> consistently) take a long time, and re-running sometimes fixes the issue, 
> sometimes not.  I’m starting to suspect core binding problems, which I worry 
> will be difficult to debug, so I hoped to get some feedback on whether my 
> observations are indeed suggesting that there’s something wrong with the core 
> binding.
> 
> I’m running withe CentOS 6 latest kernel (2.6.32-696.30.1.el6.x86_64), 
> OpenMPI 3.1.0, a dual cpu 8 core + HT intel Xeon node.  Code is compiled with 
> ifort, using “-mkl=sequential”, and just to be certain OMP_NUM_THREADS=1, so 
> there should be no OpenMP parallelism.
> 
> The main question is if I’m running 16 MPI tasks per node and look at the PSR 
> field from ps, should I get some simple sequence of numbers?
> 
> Here’s the beginning of the output report on the per-core binding I requested 
> from mpirun (—bind-to core)
> [compute-7-2:31036] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
> [BB/../../../../../../..][../../../../../../../..]
> [compute-7-2:31036] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: 
> [../../../../../../../..][BB/../../../../../../..]
> [compute-7-2:31036] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: 
> [../BB/../../../../../..][../../../../../../../..]
> [compute-7-2:31036] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: 
> [../../../../../../../..][../BB/../../../../../..]
> [compute-7-2:31036] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]: 
> [../../BB/../../../../..][../../../../../../../..]
> [compute-7-2:31036] MCW rank 5 bound to socket 1[core 10[hwt 0-1]]: 
> [../../../../../../../..][../../BB/../../../../..]
> [compute-7-2:31036] MCW rank 6 bound to socket 0[core 3[hwt 0-1]]: 
> [../../../BB/../../../..][../../../../../../../..]
> 
> This is the PSR info from ps
>   PID PSR TTY          TIME CMD
> 31043   1 ?        00:00:34 vasp.para.intel
> 31045   2 ?        00:00:34 vasp.para.intel
> 31047   3 ?        00:00:34 vasp.para.intel
> 31049   4 ?        00:00:34 vasp.para.intel
> 31051   5 ?        00:00:34 vasp.para.intel
> 31055   7 ?        00:00:34 vasp.para.intel
> 31042   8 ?        00:00:34 vasp.para.intel
> 31046  10 ?        00:00:34 vasp.para.intel
> 31048  11 ?        00:00:34 vasp.para.intel
> 31052  13 ?        00:00:34 vasp.para.intel
> 31054  14 ?        00:00:34 vasp.para.intel
> 31053  22 ?        00:00:34 vasp.para.intel
> 31044  25 ?        00:00:34 vasp.para.intel
> 31050  28 ?        00:00:34 vasp.para.intel
> 31056  31 ?        00:00:34 vasp.para.intel
> 
> Does this output look reasonable? For any sensible way I can think of to 
> enumerate the 32 virtual cores, those numbers don’t seem to correspond to one 
> mpi task per core. If this isn’t supposed to be giving meaningful output 
> given how openmpi does its binding, is there another tool that can tell me 
> what cores a running job is actually running on/bound to?
> 
> An additional bit of confusion is that "ps -mo pid,tid,fname,user,psr -p PID” 
> on one of those processes (which is supposed to be running without threaded 
> parallelism) reports 3 separate TID (which I think correspond to threads), 
> with 3 different PSR values, that seem stable during the run, but don’t have 
> any connection to one another (not P and P+1, or P and P+8, or P and P+16).
> 
> 
>                                                                               
>                 thanks,
>                                                                               
>                 Noam
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to