Re: [OMPI users] Conflicts between jobs running on the same node

2014-04-17 Thread Ralph Castain
Unfortunately, each execution of mpirun has no knowledge of where the procs
have been placed and bound by another execution of mpirun. So what is
happening is that the procs of the two jobs are being bound to the same
cores, thus causing contention.

If you truly want to run two jobs at the same time on the same nodes, then
you should add "--bind-to none" on the cmd line. Each job will see a
performance impact relative to running bound on their own, but the jobs
will run much better if they are sharing nodes.

Ralph



On Thu, Apr 17, 2014 at 10:37 AM, Alfonso Sanchez <
alfonso.sanc...@tyndall.ie> wrote:

> Hi all,
>
> I've compiled OMPI 1.8 on a x64 linux cluster using the PGI compilers
> v14.1 (I've tried it with PGI v11.10 and get the same result). I'm able to
> compile with the resulting mpicc/mpifort/etc. When running the codes,
> everything seems to be working fine when there's only one job running on a
> given computing node. However, whenever a second job gets assigned the same
> computing node, the CPU load of every process gets divided by 2. I'm using
> pbs torque. As an example:
>
> -Submit jobA using torque to node1 using mpirun -n 4
>
> -All 4 rocesses of jobA show 100% CPU load.
>
> -Submit jobB using torque to node1 using mpirun -n 4
>
> -All 8 processes ( 4 from jobA & 4 from jobB ) show 50% CPU load.
>
> Moreover, whilst jobA/jobB would run in 30 mins by itself; when both jobs
> are on the same node they've gone 14 hrs without completing.
>
> I'm attaching config.log & the output of ompi_info --all (bzipped)
>
> Some more info:
>
> $> ompi_info | grep tm
>
> MCA ess: tm (MCA v2.0, API v3.0, Component v1.8)
> MCA plm: tm (MCA v2.0, API v2.0, Component v1.8)
> MCA ras: tm (MCA v2.0, API v2.0, Component v1.8)
>
> Sorry if this is a common problem but I've tried searching for posts
> discussing similar problems but haven't been able to find any.
>
> Thanks for your help,
> Alfonso
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] Conflicts between jobs running on the same node

2014-04-17 Thread Alfonso Sanchez
Hi all,

I've compiled OMPI 1.8 on a x64 linux cluster using the PGI compilers v14.1 
(I've tried it with PGI v11.10 and get the same result). I'm able to compile 
with the resulting mpicc/mpifort/etc. When running the codes, everything seems 
to be working fine when there's only one job running on a given computing node. 
However, whenever a second job gets assigned the same computing node, the CPU 
load of every process gets divided by 2. I'm using pbs torque. As an example:

-Submit jobA using torque to node1 using mpirun -n 4

-All 4 rocesses of jobA show 100% CPU load.

-Submit jobB using torque to node1 using mpirun -n 4

-All 8 processes ( 4 from jobA & 4 from jobB ) show 50% CPU load.

Moreover, whilst jobA/jobB would run in 30 mins by itself; when both jobs are 
on the same node they've gone 14 hrs without completing.

I'm attaching config.log & the output of ompi_info --all (bzipped)

Some more info:

$> ompi_info | grep tm

MCA ess: tm (MCA v2.0, API v3.0, Component v1.8)
MCA plm: tm (MCA v2.0, API v2.0, Component v1.8)
MCA ras: tm (MCA v2.0, API v2.0, Component v1.8)

Sorry if this is a common problem but I've tried searching for posts discussing 
similar problems but haven't been able to find any.

Thanks for your help,
Alfonso

config.log.bz2
Description: config.log.bz2


ompi_output.log.bz2
Description: ompi_output.log.bz2