Thank you Ralph for the advice. I will move on to try 1.8.4 as soon as I can.
The first torque job asks for nodes=1:ppn=16:whatever
The second job asks for nodes=1:ppn=16:whatever
Both jobs happen to finish up on the same 64 core node. Each is running on its 
own set of 16 cores 0-15, and 16-31 respectively.
As soon as the second one starts, core utilisation reported by top drops from 
100% to 50% (on both). If I qdel it, the first one recovers immediately to 100%.
The behaviour reported by top is an accurate reflection of the progress of the 
calculations.
Greg
-------------------------------------------------------------------------------------------------------

Message: 1
List-Post: users@lists.open-mpi.org
Date: Wed, 28 Jan 2015 05:39:49 +0000
From: "DOHERTY, Greg" <g...@ansto.gov.au>
To: "us...@open-mpi.org" <us...@open-mpi.org>
Subject: [OMPI users] 1.8.1 [SEC=UNCLASSIFIED]
Message-ID:
        <31af19c9c3a1af4fa8fbe7a0e8f3deb81b08a...@exmbs1-b51.ansto.gov.au>
Content-Type: text/plain; charset="us-ascii"

This might or might not be related to openmpi 1.8.1. I have not seen the 
problem with the same program and previous versions of openmpi We have 64 core 
AMD nodes. I have recently recompiled  a large Monte Carlo program using 1.8.1 
version of openmpi. Users start this program using maui/torque asking for a 
number of cores, usually on only one node. One run of the program asking for 
any number of cores up to 64 runs with full cpu utilisation on each core. A 
user might start a run asking for 16 cores - fine. Then he starts a second run 
on the same node, asking for another 16 cores. Immediately the cpu utilisation 
on all cores of the first job drops to 50%, as it is for the newly starting 
job. If a different program were using the remaining 32 cores on the same node 
at the same time, the cpu utilisation of its cores is unaffected. If we qdel 
the second 16 core job, the cpu utilisation of each core of the first job 
immediately climbs back to 100%. Any suggestions please, on where I might start 
looking for the solution to this problem?
Greg Doherty
ANSTO
-------------- next part --------------
HTML attachment scrubbed and removed

------------------------------

Message: 2
List-Post: users@lists.open-mpi.org
Date: Wed, 28 Jan 2015 06:16:33 -0600
From: Ralph Castain <r...@open-mpi.org>
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] 1.8.1 [SEC=UNCLASSIFIED]
Message-ID:
        <CAMD57oeZpQzQX_WZ3B8X5AzdGUG3+RE1nD==8hgpw3_ra28...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

I'm not entirely clear on the sequence of commands here. Is the user requesting 
a new allocation from maui/torque for each run? In this case, it's possible we 
aren't correctly picking up the external binding from Torque. This would likely 
be a bug we would need to fix.

Or is the user obtaining a single allocation of the entire node, and then using 
mpirun to start multiple jobs in parallel? In this case, the issue is that the 
user needs to tell mpirun which cpus to confine itself to or else it will 
always assume that all cpus belong to it. This will lead to overloading the 
lower core numbers. The problem here can be resolved by adding --cpuset 0,1,2 
(or whatever pattern you like) to each cmd line.

You might also consider updating to 1.8.4 as we did fix some integration bugs. 
I don't recall something specific to this question, but it could be my memory 
at fault.

Ralph


On Tue, Jan 27, 2015 at 11:39 PM, DOHERTY, Greg <g...@ansto.gov.au> wrote:

>  This might or might not be related to openmpi 1.8.1. I have not seen 
> the problem with the same program and previous versions of openmpi
>
> We have 64 core AMD nodes. I have recently recompiled  a large Monte 
> Carlo program using 1.8.1 version of openmpi. Users start this program 
> using maui/torque asking for a number of cores, usually on only one 
> node. One run of the program asking for any number of cores up to 64 
> runs with full cpu utilisation on each core. A user might start a run asking 
> for 16 cores ?
> fine. Then he starts a second run on the same node, asking for another 
> 16 cores. Immediately the cpu utilisation on all cores of the first 
> job drops to 50%, as it is for the newly starting job. If a different 
> program were using the remaining 32 cores on the same node at the same 
> time, the cpu utilisation of its cores is unaffected. If we qdel the 
> second 16 core job, the cpu utilisation of each core of the first job 
> immediately climbs back to 100%. Any suggestions please, on where I 
> might start looking for the solution to this problem?
>
> Greg Doherty
>
> ANSTO
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/01/26239.php
>
-------------- next part --------------
HTML attachment scrubbed and removed

------------------------------

Subject: Digest Footer

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

------------------------------

End of users Digest, Vol 3106, Issue 1
**************************************

Reply via email to