Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-21 Thread Filippo Spiga
Dear Mike, it sounds good... the description fits my purposes... I really miss 
this when I was reading srun man page! I will give it a try

Thanks to everybody for the help and support!

F

> On Aug 21, 2014, at 7:58 PM, Mike Dubman  wrote:
> 
> Hi FIlippo,
> 
> I think you can use SLURM_LOCALID var (at least with slurm v14.03.4-2)
> 
> $srun -N2 --ntasks-per-node 3  env |grep SLURM_LOCALID
> SLURM_LOCALID=1
> SLURM_LOCALID=2
> SLURM_LOCALID=0
> SLURM_LOCALID=0
> SLURM_LOCALID=1
> SLURM_LOCALID=2
> $
> 
> Kind Regards,
> M
> 
> 
> On Thu, Aug 21, 2014 at 9:27 PM, Ralph Castain  wrote:
> 
> On Aug 21, 2014, at 10:58 AM, Filippo Spiga  wrote:
> 
>> Dear Ralph
>> 
>> On Aug 21, 2014, at 2:30 PM, Ralph Castain  wrote:
>>> I'm afraid that none of the mapping or binding options would be available 
>>> under srun as those only work via mpirun. You can pass MCA params in the 
>>> environment of course, or in default MCA param files.
>> 
>> I understand. I hopefully be able to still pass the LAMA mca options as 
>> environment variables
> 
> I'm afraid not - LAMA doesn't exist in Slurm, only in mpirun itself
> 
>> I fear by default srun completely takes over the process binding.
>> 
>> 
>> I got another problem. On my cluster I have two GPU and two Ivy Bridge 
>> processors. To maximize the PCIe bandwidth I want to allocate GPU 0 to 
>> socket 0 and GPU 1 to socket 1. I use a script like this
>> 
>> #!/bin/bash
>> lrank=$OMPI_COMM_WORLD_LOCAL_RANK
>> case ${lrank} in
>> 0)
>>  export CUDA_VISIBLE_DEVICES=0
>>  "$@"
>> ;;
>> 1)
>>  export CUDA_VISIBLE_DEVICES=1
>>  "$@"
>> ;;
>> esac
>> 
>> 
>> But OMP_COMM_WORLD_LOCAL_RANK is not defined is I use srun with PMI2 as 
>> luncher. Is there any equivalent option/environment variable that will help 
>> me achieve the same result?
> 
> I'm afraid not - that's something we added. I'm unaware of any similar envar 
> from Slurm, I'm afraid
> 
> 
>> 
>> Thanks in advance!
>> F
>> 
>> --
>> Mr. Filippo SPIGA, M.Sc.
>> http://filippospiga.info ~ skype: filippo.spiga
>> 
>> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
>> 
>> *
>> Disclaimer: "Please note this message and any attachments are CONFIDENTIAL 
>> and may be privileged or otherwise protected from disclosure. The contents 
>> are not to be disclosed to anyone other than the addressee. Unauthorized 
>> recipients are requested to preserve this confidentiality and to advise the 
>> sender immediately of any error in transmission."
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25119.php
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25120.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25121.php

--
Mr. Filippo SPIGA, M.Sc.
http://filippospiga.info ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

*
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and 
may be privileged or otherwise protected from disclosure. The contents are not 
to be disclosed to anyone other than the addressee. Unauthorized recipients are 
requested to preserve this confidentiality and to advise the sender immediately 
of any error in transmission."




Re: [OMPI users] [EXTERNAL] Re: building openmpi 1.8.1 with intel 14.0.1

2014-08-21 Thread Bosler, Peter Andrew
Update: I got both OpenMPI 1.8.1 and 1.8.2rc4 to configure and build on my
Mac laptop running OS X 10.9.4.

Neither works on the 2-day old Mac Pro, but in investigating this I found
other problems not related to OpenMPI ‹ probably hardware or OS related.
Time to exercise the warranty.

@Ralph : Thanks for the suggestion
@Gus : You are correct, Xcode continues to be a prerequisite for any
development on Mac.

Pete


On 8/21/14, 2:04 PM, "Ralph Castain"  wrote:

>FWIW: I just tried on my Mac with the Intel 14.0 compilers, and it
>configured and built just fine. However, that was with the current state
>of the 1.8 branch (the upcoming 1.8.2 release), so you might want to try
>that in case there is a difference.
>
>
>
>On Aug 21, 2014, at 12:59 PM, Gus Correa  wrote:
>
>> Hi Peter
>> 
>> If I remember right from my compilation of OMPI on a Mac
>> years ago, you need to have X-Code installed, in case you don't.
>> 
>> If vampir-trace is the only problem,
>> you can disable it when you configure OMPI (--disable-vt).
>> 
>> My two cents,
>> Gus Correa
>> 
>> 
>> On 08/21/2014 03:35 PM, Bosler, Peter Andrew wrote:
>>> Good afternoon,
>>> 
>>> I¹m having trouble configuring OpenMPI for use with the Intel
>>>compilers.
>>>  I run the command ³./configure ‹prefix=/opt/openmpi/intel CC=icc
>>> CXX=icpc FC=ifort 2>&1 | tee ~/openmpi-config.out² and I notice three
>>> problems:
>>> 
>>> 1. I get two instances of ³Report this to
>>>http://www.open-mpi.org/community/help² with regard to netinet/in.h
>>>and netinit/tcp.h in the output (attached)
>>> 2. I receive a note about Vampire Trace being broken and finally a
>>>failed configure warning
>>> 3. Configure ultimately fails because it failed to build GNU libltdl.
>>> 
>>> I¹m running Mac OS X 10.9.4 on a 3.5 Ghz 6-core Intel Xeon E5 with
>>>Intel
>>> compilers version 14.0.1.  The OpenMPI version I¹m trying to build is
>>> 1.8.1.
>>> 
>>> My environment is set with LD_LIBRARY_PATH=/opt/intel/lib/intel64
>>> 
>>> As an aside, if there are any configuration options for OpenMPI that
>>> will take special advantage of the Xeon processor, I would love to know
>>> more about them.
>>> 
>>> Thank you very much for your time.
>>> 
>>> Pete Bosler
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>>http://www.open-mpi.org/community/lists/users/2014/08/25122.php
>>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>>http://www.open-mpi.org/community/lists/users/2014/08/25123.php
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post:
>http://www.open-mpi.org/community/lists/users/2014/08/25124.php



Re: [OMPI users] building openmpi 1.8.1 with intel 14.0.1

2014-08-21 Thread Ralph Castain
FWIW: I just tried on my Mac with the Intel 14.0 compilers, and it configured 
and built just fine. However, that was with the current state of the 1.8 branch 
(the upcoming 1.8.2 release), so you might want to try that in case there is a 
difference.



On Aug 21, 2014, at 12:59 PM, Gus Correa  wrote:

> Hi Peter
> 
> If I remember right from my compilation of OMPI on a Mac
> years ago, you need to have X-Code installed, in case you don't.
> 
> If vampir-trace is the only problem,
> you can disable it when you configure OMPI (--disable-vt).
> 
> My two cents,
> Gus Correa
> 
> 
> On 08/21/2014 03:35 PM, Bosler, Peter Andrew wrote:
>> Good afternoon,
>> 
>> I’m having trouble configuring OpenMPI for use with the Intel compilers.
>>  I run the command “./configure —prefix=/opt/openmpi/intel CC=icc
>> CXX=icpc FC=ifort 2>&1 | tee ~/openmpi-config.out” and I notice three
>> problems:
>> 
>> 1. I get two instances of “Report this to
>>http://www.open-mpi.org/community/help” with regard to netinet/in.h
>>and netinit/tcp.h in the output (attached)
>> 2. I receive a note about Vampire Trace being broken and finally a
>>failed configure warning
>> 3. Configure ultimately fails because it failed to build GNU libltdl.
>> 
>> I’m running Mac OS X 10.9.4 on a 3.5 Ghz 6-core Intel Xeon E5 with Intel
>> compilers version 14.0.1.  The OpenMPI version I’m trying to build is
>> 1.8.1.
>> 
>> My environment is set with LD_LIBRARY_PATH=/opt/intel/lib/intel64
>> 
>> As an aside, if there are any configuration options for OpenMPI that
>> will take special advantage of the Xeon processor, I would love to know
>> more about them.
>> 
>> Thank you very much for your time.
>> 
>> Pete Bosler
>> 
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25122.php
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25123.php



Re: [OMPI users] building openmpi 1.8.1 with intel 14.0.1

2014-08-21 Thread Gus Correa

Hi Peter

If I remember right from my compilation of OMPI on a Mac
years ago, you need to have X-Code installed, in case you don't.

If vampir-trace is the only problem,
you can disable it when you configure OMPI (--disable-vt).

My two cents,
Gus Correa


On 08/21/2014 03:35 PM, Bosler, Peter Andrew wrote:

Good afternoon,

I’m having trouble configuring OpenMPI for use with the Intel compilers.
  I run the command “./configure —prefix=/opt/openmpi/intel CC=icc
CXX=icpc FC=ifort 2>&1 | tee ~/openmpi-config.out” and I notice three
problems:

 1. I get two instances of “Report this to
http://www.open-mpi.org/community/help” with regard to netinet/in.h
and netinit/tcp.h in the output (attached)
 2. I receive a note about Vampire Trace being broken and finally a
failed configure warning
 3. Configure ultimately fails because it failed to build GNU libltdl.

I’m running Mac OS X 10.9.4 on a 3.5 Ghz 6-core Intel Xeon E5 with Intel
compilers version 14.0.1.  The OpenMPI version I’m trying to build is
1.8.1.

My environment is set with LD_LIBRARY_PATH=/opt/intel/lib/intel64

As an aside, if there are any configuration options for OpenMPI that
will take special advantage of the Xeon processor, I would love to know
more about them.

Thank you very much for your time.

Pete Bosler




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25122.php





[OMPI users] building openmpi 1.8.1 with intel 14.0.1

2014-08-21 Thread Bosler, Peter Andrew
Good afternoon,

I'm having trouble configuring OpenMPI for use with the Intel compilers.  I run 
the command "./configure -prefix=/opt/openmpi/intel CC=icc CXX=icpc FC=ifort 
2>&1 | tee ~/openmpi-config.out" and I notice three problems:

  1.  I get two instances of "Report this to 
http://www.open-mpi.org/community/help; with regard to netinet/in.h and 
netinit/tcp.h in the output (attached)
  2.  I receive a note about Vampire Trace being broken and finally a failed 
configure warning
  3.  Configure ultimately fails because it failed to build GNU libltdl.

I'm running Mac OS X 10.9.4 on a 3.5 Ghz 6-core Intel Xeon E5 with Intel 
compilers version 14.0.1.  The OpenMPI version I'm trying to build is 1.8.1.

My environment is set with LD_LIBRARY_PATH=/opt/intel/lib/intel64

As an aside, if there are any configuration options for OpenMPI that will take 
special advantage of the Xeon processor, I would love to know more about them.

Thank you very much for your time.

Pete Bosler




ompi-config-output.tar.bz2
Description: ompi-config-output.tar.bz2


Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-21 Thread Mike Dubman
Hi FIlippo,

I think you can use SLURM_LOCALID var (at least with slurm v14.03.4-2)

$srun -N2 --ntasks-per-node 3  env |grep SLURM_LOCALID
SLURM_LOCALID=1
SLURM_LOCALID=2
SLURM_LOCALID=0
SLURM_LOCALID=0
SLURM_LOCALID=1
SLURM_LOCALID=2
$

Kind Regards,
M


On Thu, Aug 21, 2014 at 9:27 PM, Ralph Castain  wrote:

>
> On Aug 21, 2014, at 10:58 AM, Filippo Spiga 
> wrote:
>
> Dear Ralph
>
> On Aug 21, 2014, at 2:30 PM, Ralph Castain  wrote:
>
> I'm afraid that none of the mapping or binding options would be available
> under srun as those only work via mpirun. You can pass MCA params in the
> environment of course, or in default MCA param files.
>
>
> I understand. I hopefully be able to still pass the LAMA mca options as
> environment variables
>
>
> I'm afraid not - LAMA doesn't exist in Slurm, only in mpirun itself
>
> I fear by default srun completely takes over the process binding.
>
>
> I got another problem. On my cluster I have two GPU and two Ivy Bridge
> processors. To maximize the PCIe bandwidth I want to allocate GPU 0 to
> socket 0 and GPU 1 to socket 1. I use a script like this
>
> #!/bin/bash
> lrank=$OMPI_COMM_WORLD_LOCAL_RANK
> case ${lrank} in
> 0)
>  export CUDA_VISIBLE_DEVICES=0
>  "$@"
> ;;
> 1)
>  export CUDA_VISIBLE_DEVICES=1
>  "$@"
> ;;
> esac
>
>
> But OMP_COMM_WORLD_LOCAL_RANK is not defined is I use srun with PMI2 as
> luncher. Is there any equivalent option/environment variable that will help
> me achieve the same result?
>
>
> I'm afraid not - that's something we added. I'm unaware of any similar
> envar from Slurm, I'm afraid
>
>
>
> Thanks in advance!
> F
>
> --
> Mr. Filippo SPIGA, M.Sc.
> http://filippospiga.info ~ skype: filippo.spiga
>
> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
>
> *
> Disclaimer: "Please note this message and any attachments are CONFIDENTIAL
> and may be privileged or otherwise protected from disclosure. The contents
> are not to be disclosed to anyone other than the addressee. Unauthorized
> recipients are requested to preserve this confidentiality and to advise the
> sender immediately of any error in transmission."
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25119.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25120.php
>


Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-21 Thread Ralph Castain

On Aug 21, 2014, at 10:58 AM, Filippo Spiga  wrote:

> Dear Ralph
> 
> On Aug 21, 2014, at 2:30 PM, Ralph Castain  wrote:
>> I'm afraid that none of the mapping or binding options would be available 
>> under srun as those only work via mpirun. You can pass MCA params in the 
>> environment of course, or in default MCA param files.
> 
> I understand. I hopefully be able to still pass the LAMA mca options as 
> environment variables

I'm afraid not - LAMA doesn't exist in Slurm, only in mpirun itself

> I fear by default srun completely takes over the process binding.
> 
> 
> I got another problem. On my cluster I have two GPU and two Ivy Bridge 
> processors. To maximize the PCIe bandwidth I want to allocate GPU 0 to socket 
> 0 and GPU 1 to socket 1. I use a script like this
> 
> #!/bin/bash
> lrank=$OMPI_COMM_WORLD_LOCAL_RANK
> case ${lrank} in
> 0)
>  export CUDA_VISIBLE_DEVICES=0
>  "$@"
> ;;
> 1)
>  export CUDA_VISIBLE_DEVICES=1
>  "$@"
> ;;
> esac
> 
> 
> But OMP_COMM_WORLD_LOCAL_RANK is not defined is I use srun with PMI2 as 
> luncher. Is there any equivalent option/environment variable that will help 
> me achieve the same result?

I'm afraid not - that's something we added. I'm unaware of any similar envar 
from Slurm, I'm afraid


> 
> Thanks in advance!
> F
> 
> --
> Mr. Filippo SPIGA, M.Sc.
> http://filippospiga.info ~ skype: filippo.spiga
> 
> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
> 
> *
> Disclaimer: "Please note this message and any attachments are CONFIDENTIAL 
> and may be privileged or otherwise protected from disclosure. The contents 
> are not to be disclosed to anyone other than the addressee. Unauthorized 
> recipients are requested to preserve this confidentiality and to advise the 
> sender immediately of any error in transmission."
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25119.php



Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-21 Thread Filippo Spiga
Dear Ralph

On Aug 21, 2014, at 2:30 PM, Ralph Castain  wrote:
> I'm afraid that none of the mapping or binding options would be available 
> under srun as those only work via mpirun. You can pass MCA params in the 
> environment of course, or in default MCA param files.

I understand. I hopefully be able to still pass the LAMA mca options as 
environment variablesI fear by default srun completely takes over the 
process binding.


I got another problem. On my cluster I have two GPU and two Ivy Bridge 
processors. To maximize the PCIe bandwidth I want to allocate GPU 0 to socket 0 
and GPU 1 to socket 1. I use a script like this

#!/bin/bash
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
case ${lrank} in
0)
 export CUDA_VISIBLE_DEVICES=0
 "$@"
;;
1)
 export CUDA_VISIBLE_DEVICES=1
 "$@"
;;
esac


But OMP_COMM_WORLD_LOCAL_RANK is not defined is I use srun with PMI2 as 
luncher. Is there any equivalent option/environment variable that will help me 
achieve the same result?

Thanks in advance!
F

--
Mr. Filippo SPIGA, M.Sc.
http://filippospiga.info ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

*
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and 
may be privileged or otherwise protected from disclosure. The contents are not 
to be disclosed to anyone other than the addressee. Unauthorized recipients are 
requested to preserve this confidentiality and to advise the sender immediately 
of any error in transmission."




Re: [OMPI users] OpenMPI 1.8.1 to 1.8.2rc4

2014-08-21 Thread Ralph Castain
Should not be required (unless they are statically built) as we do strive to 
maintain ABI within a series

On Aug 21, 2014, at 9:39 AM, Maxime Boissonneault 
 wrote:

> Hi,
> Would you say that softwares compiled using OpenMPI 1.8.1 need to be 
> recompiled using OpenMPI 1.8.2rc4 to work properly ?
> 
> Maxime
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25117.php



[OMPI users] OpenMPI 1.8.1 to 1.8.2rc4

2014-08-21 Thread Maxime Boissonneault

Hi,
Would you say that softwares compiled using OpenMPI 1.8.1 need to be 
recompiled using OpenMPI 1.8.2rc4 to work properly ?


Maxime


Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti
Am 21.08.2014 um 16:50 schrieb Reuti:

> Am 21.08.2014 um 16:00 schrieb Ralph Castain:
> 
>> 
>> On Aug 21, 2014, at 6:54 AM, Reuti  wrote:
>> 
>>> Am 21.08.2014 um 15:45 schrieb Ralph Castain:
>>> 
 On Aug 21, 2014, at 2:51 AM, Reuti  wrote:
 
> Am 20.08.2014 um 23:16 schrieb Ralph Castain:
> 
>> 
>> On Aug 20, 2014, at 11:16 AM, Reuti  wrote:
>> 
>>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
>>> 
> 
> Aha, this is quite interesting - how do you do this: scanning the 
> /proc//status or alike? What happens if you don't find enough 
> free cores as they are used up by other applications already?
> 
 
 Remember, when you use mpirun to launch, we launch our own daemons 
 using the native launcher (e.g., qsub). So the external RM will bind 
 our daemons to the specified cores on each node. We use hwloc to 
 determine what cores our daemons are bound to, and then bind our own 
 child processes to cores within that range.
>>> 
>>> Thx for reminding me of this. Indeed, I mixed up two different aspects 
>>> in this discussion.
>>> 
>>> a) What will happen in case no binding was done by the RM (hence Open 
>>> MPI could use all cores) and two Open MPI jobs (or something completely 
>>> different besides one Open MPI job) are running on the same node (due 
>>> to the Tight Integration with two different Open MPI directories in 
>>> /tmp and two `orted`, unique for each job)? Will the second Open MPI 
>>> job know what the first Open MPI job used up already? Or will both use 
>>> the same set of cores as "-bind-to none" can't be set in the given 
>>> `mpiexec` command because of "-map-by slot:pe=$OMP_NUM_THREADS" was 
>>> used - which triggers "-bind-to core" indispensable and can't be 
>>> switched off? I see the same cores being used for both jobs.
>> 
>> Yeah, each mpirun executes completely independently of the other, so 
>> they have no idea what the other is doing. So the cores will be 
>> overloaded. Multi-pe's requires bind-to-core otherwise there is no way 
>> to implement the request
> 
> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
> "-bind-to none" here?
 
 Guess I'm confused here - what does pe=N mean if we bind-to none?? If you 
 are running on a mixed cluster and don't want binding, then just say 
 bind-to none and leave the pe argument out entirely as it wouldn't mean 
 anything unless you are bound
>>> 
>>> I would mean: divide the overall number of slots/cores in the machinefile 
>>> by N (i.e. $OMP_NUM_THREADS).
>>> 
>>> - Request made to the queuing system: I need 80 cores in total.
>>> - The machinefile will contain 80 cores
>>> - Open MPI will divide it by N, i.e. 8 here
>>> - Open MPI will start only 10 processes, one on each node
>>> - The application will use 8 threads per started MPI process
>> 
>> I see - so you were talking about the case where the user doesn't provide 
>> the -np N option
> 
> Yes. Even if -np is specified: AFAICS Open MPI fills up the given slots in 
> the machinefile from the beginning (first nodes get all the processes, 
> remaining nodes are free). Making it in a round-robin way would work better 
> for this case.
> 
> 
>> and we need to compute the number of procs to start. Okay, the change you 
>> requested below will fix that one too. I can make that easily enough.
> 
> Therefore I wanted to start a discussion about it (at that time I wasn't 
> aware of the "-map-by slot:pe=N" option), as I have no final syntax which 
> would cover all cases. Someone may want the binding by the "-map-by 
> slot:pe=N". How can this be specified, while keeping an easy 
> tight-integration for users who don't want any binding at all.
> 
> The boundary conditions are:
> 
> - the job is running inside a queuingsystem
> - the user requests the overall amount of slots to the queuingsystem
> - hence the machinefile has entries for all slots

Ok, less typos:

BTW: The fact that the queuingsystem is set up in such a way that the 
machinefile contains a mutiple of  $OMP_NUM_THREADS per node is a premise and 
can be seen as given here - otherwise generate an error. It's up to admin of 
the queuingsystem to configure it in such a way.

-- Reuti


> - the user sets OMP_NUM_THREADS
> 
> case 1) no interest in any binding, other jobs may exist on the nodes
> 
> case 2) user wants binding: i.e. $OMP_NUM_THREADS cores assigned to each MPI 
> process, maybe with "-map-by slot:pe=N"
> 
> In both cases only (overall amount of slots) / ($OMP_NUM_THREADS) MPI 
> processes should be started, not (overall amount of slots) processes AFAICS.
> 
> -- Reuti
> 
> 
>>> -- Reuti
>>> 
>>> 
> 
> 
>>> Altering the machinefile instead: the processes are not 

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti
Am 21.08.2014 um 16:50 schrieb Reuti:

> Am 21.08.2014 um 16:00 schrieb Ralph Castain:
> 
>> 
>> On Aug 21, 2014, at 6:54 AM, Reuti  wrote:
>> 
>>> Am 21.08.2014 um 15:45 schrieb Ralph Castain:
>>> 
 On Aug 21, 2014, at 2:51 AM, Reuti  wrote:
 
> Am 20.08.2014 um 23:16 schrieb Ralph Castain:
> 
>> 
>> On Aug 20, 2014, at 11:16 AM, Reuti  wrote:
>> 
>>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
>>> 
> 
> Aha, this is quite interesting - how do you do this: scanning the 
> /proc//status or alike? What happens if you don't find enough 
> free cores as they are used up by other applications already?
> 
 
 Remember, when you use mpirun to launch, we launch our own daemons 
 using the native launcher (e.g., qsub). So the external RM will bind 
 our daemons to the specified cores on each node. We use hwloc to 
 determine what cores our daemons are bound to, and then bind our own 
 child processes to cores within that range.
>>> 
>>> Thx for reminding me of this. Indeed, I mixed up two different aspects 
>>> in this discussion.
>>> 
>>> a) What will happen in case no binding was done by the RM (hence Open 
>>> MPI could use all cores) and two Open MPI jobs (or something completely 
>>> different besides one Open MPI job) are running on the same node (due 
>>> to the Tight Integration with two different Open MPI directories in 
>>> /tmp and two `orted`, unique for each job)? Will the second Open MPI 
>>> job know what the first Open MPI job used up already? Or will both use 
>>> the same set of cores as "-bind-to none" can't be set in the given 
>>> `mpiexec` command because of "-map-by slot:pe=$OMP_NUM_THREADS" was 
>>> used - which triggers "-bind-to core" indispensable and can't be 
>>> switched off? I see the same cores being used for both jobs.
>> 
>> Yeah, each mpirun executes completely independently of the other, so 
>> they have no idea what the other is doing. So the cores will be 
>> overloaded. Multi-pe's requires bind-to-core otherwise there is no way 
>> to implement the request
> 
> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
> "-bind-to none" here?
 
 Guess I'm confused here - what does pe=N mean if we bind-to none?? If you 
 are running on a mixed cluster and don't want binding, then just say 
 bind-to none and leave the pe argument out entirely as it wouldn't mean 
 anything unless you are bound
>>> 
>>> I would mean: divide the overall number of slots/cores in the machinefile 
>>> by N (i.e. $OMP_NUM_THREADS).
>>> 
>>> - Request made to the queuing system: I need 80 cores in total.
>>> - The machinefile will contain 80 cores
>>> - Open MPI will divide it by N, i.e. 8 here
>>> - Open MPI will start only 10 processes, one on each node
>>> - The application will use 8 threads per started MPI process
>> 
>> I see - so you were talking about the case where the user doesn't provide 
>> the -np N option
> 
> Yes. Even if -np is specified: AFAICS Open MPI fills up the given slots in 
> the machinefile from the beginning (first nodes get all the processes, 
> remaining nodes are free). Making it in a round-robin way would work better 
> for this case.
> 
> 
>> and we need to compute the number of procs to start. Okay, the change you 
>> requested below will fix that one too. I can make that easily enough.
> 
> Therefore I wanted to start a discussion about it (at that time I wasn't 
> aware of the "-map-by slot:pe=N" option), as I have no final syntax which 
> would cover all cases. Someone may want the binding by the "-map-by 
> slot:pe=N". How can this be specified, while keeping an easy 
> tight-integration for users who don't want any binding at all.
> 
> The boundary conditions are:
> 
> - the job is running inside a queuingsystem
> - the user requests the overall amount of slots to the queuingsystem
> - hence the machinefile has entries for all slots

BTW: That fact that the queuingsystem is set up in such a that the machinefile 
contains a mutiple of  $OMP_NUM_THREADS per node is a premise and can be seen 
as given here - otherwise generate an error. It's up to admin of the 
queuingsystem to configure it in such a way.

-- Reuti


> - the user sets OMP_NUM_THREADS
> 
> case 1) no interest in any binding, other jobs may exist on the nodes
> 
> case 2) user wants binding: i.e. $OMP_NUM_THREADS cores assigned to each MPI 
> process, maybe with "-map-by slot:pe=N"
> 
> In both cases only (overall amount of slots) / ($OMP_NUM_THREADS) MPI 
> processes should be started, not (overall amount of slots) processes AFAICS.
> 
> -- Reuti
> 
> 
>>> -- Reuti
>>> 
>>> 
> 
> 
>>> Altering the machinefile instead: the processes are not bound to any 
>>> 

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti
Am 21.08.2014 um 16:00 schrieb Ralph Castain:

> 
> On Aug 21, 2014, at 6:54 AM, Reuti  wrote:
> 
>> Am 21.08.2014 um 15:45 schrieb Ralph Castain:
>> 
>>> On Aug 21, 2014, at 2:51 AM, Reuti  wrote:
>>> 
 Am 20.08.2014 um 23:16 schrieb Ralph Castain:
 
> 
> On Aug 20, 2014, at 11:16 AM, Reuti  wrote:
> 
>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
>> 
 
 Aha, this is quite interesting - how do you do this: scanning the 
 /proc//status or alike? What happens if you don't find enough 
 free cores as they are used up by other applications already?
 
>>> 
>>> Remember, when you use mpirun to launch, we launch our own daemons 
>>> using the native launcher (e.g., qsub). So the external RM will bind 
>>> our daemons to the specified cores on each node. We use hwloc to 
>>> determine what cores our daemons are bound to, and then bind our own 
>>> child processes to cores within that range.
>> 
>> Thx for reminding me of this. Indeed, I mixed up two different aspects 
>> in this discussion.
>> 
>> a) What will happen in case no binding was done by the RM (hence Open 
>> MPI could use all cores) and two Open MPI jobs (or something completely 
>> different besides one Open MPI job) are running on the same node (due to 
>> the Tight Integration with two different Open MPI directories in /tmp 
>> and two `orted`, unique for each job)? Will the second Open MPI job know 
>> what the first Open MPI job used up already? Or will both use the same 
>> set of cores as "-bind-to none" can't be set in the given `mpiexec` 
>> command because of "-map-by slot:pe=$OMP_NUM_THREADS" was used - which 
>> triggers "-bind-to core" indispensable and can't be switched off? I see 
>> the same cores being used for both jobs.
> 
> Yeah, each mpirun executes completely independently of the other, so they 
> have no idea what the other is doing. So the cores will be overloaded. 
> Multi-pe's requires bind-to-core otherwise there is no way to implement 
> the request
 
 Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
 "-bind-to none" here?
>>> 
>>> Guess I'm confused here - what does pe=N mean if we bind-to none?? If you 
>>> are running on a mixed cluster and don't want binding, then just say 
>>> bind-to none and leave the pe argument out entirely as it wouldn't mean 
>>> anything unless you are bound
>> 
>> I would mean: divide the overall number of slots/cores in the machinefile by 
>> N (i.e. $OMP_NUM_THREADS).
>> 
>> - Request made to the queuing system: I need 80 cores in total.
>> - The machinefile will contain 80 cores
>> - Open MPI will divide it by N, i.e. 8 here
>> - Open MPI will start only 10 processes, one on each node
>> - The application will use 8 threads per started MPI process
> 
> I see - so you were talking about the case where the user doesn't provide the 
> -np N option

Yes. Even if -np is specified: AFAICS Open MPI fills up the given slots in the 
machinefile from the beginning (first nodes get all the processes, remaining 
nodes are free). Making it in a round-robin way would work better for this case.


> and we need to compute the number of procs to start. Okay, the change you 
> requested below will fix that one too. I can make that easily enough.

Therefore I wanted to start a discussion about it (at that time I wasn't aware 
of the "-map-by slot:pe=N" option), as I have no final syntax which would cover 
all cases. Someone may want the binding by the "-map-by slot:pe=N". How can 
this be specified, while keeping an easy tight-integration for users who don't 
want any binding at all.

The boundary conditions are:

- the job is running inside a queuingsystem
- the user requests the overall amount of slots to the queuingsystem
- hence the machinefile has entries for all slots
- the user sets OMP_NUM_THREADS

case 1) no interest in any binding, other jobs may exist on the nodes

case 2) user wants binding: i.e. $OMP_NUM_THREADS cores assigned to each MPI 
process, maybe with "-map-by slot:pe=N"

In both cases only (overall amount of slots) / ($OMP_NUM_THREADS) MPI processes 
should be started, not (overall amount of slots) processes AFAICS.

-- Reuti


>> -- Reuti
>> 
>> 
 
 
>> Altering the machinefile instead: the processes are not bound to any 
>> core, and the OS takes care of a proper assignment.
 
 Here the ordinary user has to mangle the hostfile, this is not good (but 
 allows several jobs per node as the OS shift the processes around). 
 Could/should it be put into the "gridengine" module in OpenMPI, to divide 
 the slot count per node automatically when $OMP_NUM_THREADS is found, or 
 generate an error if it's not divisible?
>>> 
>>> Sure, that could be done - but it will only 

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Ralph Castain

On Aug 21, 2014, at 6:54 AM, Reuti  wrote:

> Am 21.08.2014 um 15:45 schrieb Ralph Castain:
> 
>> On Aug 21, 2014, at 2:51 AM, Reuti  wrote:
>> 
>>> Am 20.08.2014 um 23:16 schrieb Ralph Castain:
>>> 
 
 On Aug 20, 2014, at 11:16 AM, Reuti  wrote:
 
> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
> 
>>> 
>>> Aha, this is quite interesting - how do you do this: scanning the 
>>> /proc//status or alike? What happens if you don't find enough free 
>>> cores as they are used up by other applications already?
>>> 
>> 
>> Remember, when you use mpirun to launch, we launch our own daemons using 
>> the native launcher (e.g., qsub). So the external RM will bind our 
>> daemons to the specified cores on each node. We use hwloc to determine 
>> what cores our daemons are bound to, and then bind our own child 
>> processes to cores within that range.
> 
> Thx for reminding me of this. Indeed, I mixed up two different aspects in 
> this discussion.
> 
> a) What will happen in case no binding was done by the RM (hence Open MPI 
> could use all cores) and two Open MPI jobs (or something completely 
> different besides one Open MPI job) are running on the same node (due to 
> the Tight Integration with two different Open MPI directories in /tmp and 
> two `orted`, unique for each job)? Will the second Open MPI job know what 
> the first Open MPI job used up already? Or will both use the same set of 
> cores as "-bind-to none" can't be set in the given `mpiexec` command 
> because of "-map-by slot:pe=$OMP_NUM_THREADS" was used - which triggers 
> "-bind-to core" indispensable and can't be switched off? I see the same 
> cores being used for both jobs.
 
 Yeah, each mpirun executes completely independently of the other, so they 
 have no idea what the other is doing. So the cores will be overloaded. 
 Multi-pe's requires bind-to-core otherwise there is no way to implement 
 the request
>>> 
>>> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
>>> "-bind-to none" here?
>> 
>> Guess I'm confused here - what does pe=N mean if we bind-to none?? If you 
>> are running on a mixed cluster and don't want binding, then just say bind-to 
>> none and leave the pe argument out entirely as it wouldn't mean anything 
>> unless you are bound
> 
> I would mean: divide the overall number of slots/cores in the machinefile by 
> N (i.e. $OMP_NUM_THREADS).
> 
> - Request made to the queuing system: I need 80 cores in total.
> - The machinefile will contain 80 cores
> - Open MPI will divide it by N, i.e. 8 here
> - Open MPI will start only 10 processes, one on each node
> - The application will use 8 threads per started MPI process

I see - so you were talking about the case where the user doesn't provide the 
-np N option and we need to compute the number of procs to start. Okay, the 
change you requested below will fix that one too. I can make that easily enough.

> 
> -- Reuti
> 
> 
>>> 
>>> 
> Altering the machinefile instead: the processes are not bound to any 
> core, and the OS takes care of a proper assignment.
>>> 
>>> Here the ordinary user has to mangle the hostfile, this is not good (but 
>>> allows several jobs per node as the OS shift the processes around). 
>>> Could/should it be put into the "gridengine" module in OpenMPI, to divide 
>>> the slot count per node automatically when $OMP_NUM_THREADS is found, or 
>>> generate an error if it's not divisible?
>> 
>> Sure, that could be done - but it will only have if OMP_NUM_THREADS is set 
>> when someone spins off threads. So far as I know, that's only used for 
>> OpenMP - so we'd get a little help, but it wouldn't be full coverage.
>> 
>> 
>>> 
>>> ===
>>> 
>> If the cores we are bound to are the same on each node, then we will do 
>> this with no further instruction. However, if the cores are different on 
>> the individual nodes, then you need to add --hetero-nodes to your 
>> command line (as the nodes appear to be heterogeneous to us).
> 
> b) Aha, it's not about different type CPU types, but also same CPU type 
> but different allocations between the nodes? It's not in the `mpiexec` 
> man-page of 1.8.1 though. I'll have a look at it.
>>> 
>>> I tried:
>>> 
>>> $ qsub -binding linear:2:0 -pe smp2 8 -masterq parallel@node01 -q 
>>> parallel@node0[1-4] test_openmpi.sh 
>>> Your job 247109 ("test_openmpi.sh") has been submitted
>>> $ qsub -binding linear:2:1 -pe smp2 8 -masterq parallel@node01 -q 
>>> parallel@node0[1-4] test_openmpi.sh 
>>> Your job 247110 ("test_openmpi.sh") has been submitted
>>> 
>>> 
>>> Getting on node03:
>>> 
>>> 
>>> 6733 ?Sl 0:00  \_ sge_shepherd-247109 -bg
>>> 6734 ?SNs0:00  |   \_ /usr/sge/utilbin/lx24-amd64/qrsh_starter 
>>> 

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti
Am 21.08.2014 um 15:45 schrieb Ralph Castain:

> On Aug 21, 2014, at 2:51 AM, Reuti  wrote:
> 
>> Am 20.08.2014 um 23:16 schrieb Ralph Castain:
>> 
>>> 
>>> On Aug 20, 2014, at 11:16 AM, Reuti  wrote:
>>> 
 Am 20.08.2014 um 19:05 schrieb Ralph Castain:
 
>> 
>> Aha, this is quite interesting - how do you do this: scanning the 
>> /proc//status or alike? What happens if you don't find enough free 
>> cores as they are used up by other applications already?
>> 
> 
> Remember, when you use mpirun to launch, we launch our own daemons using 
> the native launcher (e.g., qsub). So the external RM will bind our 
> daemons to the specified cores on each node. We use hwloc to determine 
> what cores our daemons are bound to, and then bind our own child 
> processes to cores within that range.
 
 Thx for reminding me of this. Indeed, I mixed up two different aspects in 
 this discussion.
 
 a) What will happen in case no binding was done by the RM (hence Open MPI 
 could use all cores) and two Open MPI jobs (or something completely 
 different besides one Open MPI job) are running on the same node (due to 
 the Tight Integration with two different Open MPI directories in /tmp and 
 two `orted`, unique for each job)? Will the second Open MPI job know what 
 the first Open MPI job used up already? Or will both use the same set of 
 cores as "-bind-to none" can't be set in the given `mpiexec` command 
 because of "-map-by slot:pe=$OMP_NUM_THREADS" was used - which triggers 
 "-bind-to core" indispensable and can't be switched off? I see the same 
 cores being used for both jobs.
>>> 
>>> Yeah, each mpirun executes completely independently of the other, so they 
>>> have no idea what the other is doing. So the cores will be overloaded. 
>>> Multi-pe's requires bind-to-core otherwise there is no way to implement the 
>>> request
>> 
>> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
>> "-bind-to none" here?
> 
> Guess I'm confused here - what does pe=N mean if we bind-to none?? If you are 
> running on a mixed cluster and don't want binding, then just say bind-to none 
> and leave the pe argument out entirely as it wouldn't mean anything unless 
> you are bound

I would mean: divide the overall number of slots/cores in the machinefile by N 
(i.e. $OMP_NUM_THREADS).

- Request made to the queuing system: I need 80 cores in total.
- The machinefile will contain 80 cores
- Open MPI will divide it by N, i.e. 8 here
- Open MPI will start only 10 processes, one on each node
- The application will use 8 threads per started MPI process

-- Reuti


>> 
>> 
 Altering the machinefile instead: the processes are not bound to any core, 
 and the OS takes care of a proper assignment.
>> 
>> Here the ordinary user has to mangle the hostfile, this is not good (but 
>> allows several jobs per node as the OS shift the processes around). 
>> Could/should it be put into the "gridengine" module in OpenMPI, to divide 
>> the slot count per node automatically when $OMP_NUM_THREADS is found, or 
>> generate an error if it's not divisible?
> 
> Sure, that could be done - but it will only have if OMP_NUM_THREADS is set 
> when someone spins off threads. So far as I know, that's only used for OpenMP 
> - so we'd get a little help, but it wouldn't be full coverage.
> 
> 
>> 
>> ===
>> 
> If the cores we are bound to are the same on each node, then we will do 
> this with no further instruction. However, if the cores are different on 
> the individual nodes, then you need to add --hetero-nodes to your command 
> line (as the nodes appear to be heterogeneous to us).
 
 b) Aha, it's not about different type CPU types, but also same CPU type 
 but different allocations between the nodes? It's not in the `mpiexec` 
 man-page of 1.8.1 though. I'll have a look at it.
>> 
>> I tried:
>> 
>> $ qsub -binding linear:2:0 -pe smp2 8 -masterq parallel@node01 -q 
>> parallel@node0[1-4] test_openmpi.sh 
>> Your job 247109 ("test_openmpi.sh") has been submitted
>> $ qsub -binding linear:2:1 -pe smp2 8 -masterq parallel@node01 -q 
>> parallel@node0[1-4] test_openmpi.sh 
>> Your job 247110 ("test_openmpi.sh") has been submitted
>> 
>> 
>> Getting on node03:
>> 
>> 
>> 6733 ?Sl 0:00  \_ sge_shepherd-247109 -bg
>> 6734 ?SNs0:00  |   \_ /usr/sge/utilbin/lx24-amd64/qrsh_starter 
>> /var/spool/sge/node03/active_jobs/247109.1/1.node03
>> 6741 ?SN 0:00  |   \_ orted -mca orte_hetero_nodes 1 -mca 
>> ess env -mca orte_ess_jobid 1493303296 -mca orte_ess_vpid
>> 6742 ?RNl0:31  |   \_ ./mpihello
>> 6745 ?Sl 0:00  \_ sge_shepherd-247110 -bg
>> 6746 ?SNs0:00  \_ /usr/sge/utilbin/lx24-amd64/qrsh_starter 
>> /var/spool/sge/node03/active_jobs/247110.1/1.node03
>> 6753 ? 

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Ralph Castain

On Aug 21, 2014, at 2:51 AM, Reuti  wrote:

> Am 20.08.2014 um 23:16 schrieb Ralph Castain:
> 
>> 
>> On Aug 20, 2014, at 11:16 AM, Reuti  wrote:
>> 
>>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
>>> 
> 
> Aha, this is quite interesting - how do you do this: scanning the 
> /proc//status or alike? What happens if you don't find enough free 
> cores as they are used up by other applications already?
> 
 
 Remember, when you use mpirun to launch, we launch our own daemons using 
 the native launcher (e.g., qsub). So the external RM will bind our daemons 
 to the specified cores on each node. We use hwloc to determine what cores 
 our daemons are bound to, and then bind our own child processes to cores 
 within that range.
>>> 
>>> Thx for reminding me of this. Indeed, I mixed up two different aspects in 
>>> this discussion.
>>> 
>>> a) What will happen in case no binding was done by the RM (hence Open MPI 
>>> could use all cores) and two Open MPI jobs (or something completely 
>>> different besides one Open MPI job) are running on the same node (due to 
>>> the Tight Integration with two different Open MPI directories in /tmp and 
>>> two `orted`, unique for each job)? Will the second Open MPI job know what 
>>> the first Open MPI job used up already? Or will both use the same set of 
>>> cores as "-bind-to none" can't be set in the given `mpiexec` command 
>>> because of "-map-by slot:pe=$OMP_NUM_THREADS" was used - which triggers 
>>> "-bind-to core" indispensable and can't be switched off? I see the same 
>>> cores being used for both jobs.
>> 
>> Yeah, each mpirun executes completely independently of the other, so they 
>> have no idea what the other is doing. So the cores will be overloaded. 
>> Multi-pe's requires bind-to-core otherwise there is no way to implement the 
>> request
> 
> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
> "-bind-to none" here?

Guess I'm confused here - what does pe=N mean if we bind-to none?? If you are 
running on a mixed cluster and don't want binding, then just say bind-to none 
and leave the pe argument out entirely as it wouldn't mean anything unless you 
are bound

> 
> 
>>> Altering the machinefile instead: the processes are not bound to any core, 
>>> and the OS takes care of a proper assignment.
> 
> Here the ordinary user has to mangle the hostfile, this is not good (but 
> allows several jobs per node as the OS shift the processes around). 
> Could/should it be put into the "gridengine" module in OpenMPI, to divide the 
> slot count per node automatically when $OMP_NUM_THREADS is found, or generate 
> an error if it's not divisible?

Sure, that could be done - but it will only have if OMP_NUM_THREADS is set when 
someone spins off threads. So far as I know, that's only used for OpenMP - so 
we'd get a little help, but it wouldn't be full coverage.


> 
> ===
> 
 If the cores we are bound to are the same on each node, then we will do 
 this with no further instruction. However, if the cores are different on 
 the individual nodes, then you need to add --hetero-nodes to your command 
 line (as the nodes appear to be heterogeneous to us).
>>> 
>>> b) Aha, it's not about different type CPU types, but also same CPU type but 
>>> different allocations between the nodes? It's not in the `mpiexec` man-page 
>>> of 1.8.1 though. I'll have a look at it.
> 
> I tried:
> 
> $ qsub -binding linear:2:0 -pe smp2 8 -masterq parallel@node01 -q 
> parallel@node0[1-4] test_openmpi.sh 
> Your job 247109 ("test_openmpi.sh") has been submitted
> $ qsub -binding linear:2:1 -pe smp2 8 -masterq parallel@node01 -q 
> parallel@node0[1-4] test_openmpi.sh 
> Your job 247110 ("test_openmpi.sh") has been submitted
> 
> 
> Getting on node03:
> 
> 
> 6733 ?Sl 0:00  \_ sge_shepherd-247109 -bg
> 6734 ?SNs0:00  |   \_ /usr/sge/utilbin/lx24-amd64/qrsh_starter 
> /var/spool/sge/node03/active_jobs/247109.1/1.node03
> 6741 ?SN 0:00  |   \_ orted -mca orte_hetero_nodes 1 -mca ess 
> env -mca orte_ess_jobid 1493303296 -mca orte_ess_vpid
> 6742 ?RNl0:31  |   \_ ./mpihello
> 6745 ?Sl 0:00  \_ sge_shepherd-247110 -bg
> 6746 ?SNs0:00  \_ /usr/sge/utilbin/lx24-amd64/qrsh_starter 
> /var/spool/sge/node03/active_jobs/247110.1/1.node03
> 6753 ?SN 0:00  \_ orted -mca orte_hetero_nodes 1 -mca ess 
> env -mca orte_ess_jobid 1506607104 -mca orte_ess_vpid
> 6754 ?RNl0:25  \_ ./mpihello
> 
> 
> reuti@node03:~> cat /proc/6741/status | grep Cpus_
> Cpus_allowed: 
> ,,,,,,,,,,,,,,,0003
> Cpus_allowed_list:0-1
> reuti@node03:~> cat /proc/6753/status | grep Cpus_
> Cpus_allowed: 
> 

Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-21 Thread Ralph Castain

On Aug 20, 2014, at 11:46 PM, Filippo Spiga  wrote:

> Hi Joshua,
> 
> On Aug 21, 2014, at 12:28 AM, Joshua Ladd  wrote:
>> When launching with mpirun in a SLURM environment, srun is only being used 
>> to launch the ORTE daemons (orteds.)  Since the daemon will already exist on 
>> the node from which you invoked mpirun, this node will not be included in 
>> the list of nodes. SLURM's PMI library is not involved (that functionality 
>> is only necessary if you directly launch your MPI application with srun, in 
>> which case it is used to exchanged wireup info amongst slurmds.) This is the 
>> expected behavior. 
>> 
>> ~/ompi-top-level/orte/mca/plm/plm_slurm_module.c +294
>> /* if the daemon already exists on this node, then
>> * don't include it
>> */
>>if (node->daemon_launched) {
>>continue;
>>}
>> 
>> Do you have a frontend node that you can launch from? What happens if you 
>> set "-np X" where X = 8*ppn. The alternative is to do a direct launch of the 
>> MPI application with srun.
> 
> I understand the logic and I understand with orted in the first node is not 
> needed. But since we use a batch system (SLURM) we do not want people to run 
> their mpirun commands directly fon a front-end. Typical scenario: all compute 
> node are running fine but we reboot all the login nodes to upgrade the linux 
> image because of a security update the kernel. We can keep the login nodes 
> offline potentially for hours without stop the system to work. 
> 
> From our perspective, a front-end node is an additional burden. Of course 
> login node and front-end node can be two separated hosts but I am looking for 
> a way to keep our setup as-it-is without introducing structural changes. 
> 
> 
> Hi Ralph,
> 
> On Aug 21, 2014, at 12:36 AM, Ralph Castain  wrote:
>> Or you can add 
>> 
>>   -nolocal|--nolocalDo not run any MPI applications on the local node
>> 
>> to your mpirun command line and we won't run any application procs on the 
>> node where mpirun is executing
> 
> I tried but of course but mpirun complains. If it cannot run local (meaning 
> on the first node, tesla121) then only 7 nodes remains and I request in total 
> 8. So to use "--nolocal" I need to add another nodes. Since we allocate node 
> exclusively and for some users we charge the usage real money... this is not 
> ideal I am afraid.
> 
> 
> srun seems the only solution to go. I need to understand how to pass most of 
> the --mca parameters to srun and to be sure I can pilot  rmaps_lama_* options 
> as flexible as I do with normal mpirun. Then there are mxm, fca, hcollI 
> am not against srun in principle, my only stopping point it that the syntax 
> is only different that we might receive lot (too many) complains our users in 
> adopting this new way to submit because they are used to use classic mpirun 
> inside a sbatch script. Most of them will probably not switch to a different 
> method! So our hope to "silently" profile network, energy, I/O using SLURM 
> plugins also using Open MPI is lost...

I'm afraid that none of the mapping or binding options would be available under 
srun as those only work via mpirun. You can pass MCA params in the environment 
of course, or in default MCA param files.

> 
> F
> 
> --
> Mr. Filippo SPIGA, M.Sc.
> http://filippospiga.info ~ skype: filippo.spiga
> 
> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
> 
> *
> Disclaimer: "Please note this message and any attachments are CONFIDENTIAL 
> and may be privileged or otherwise protected from disclosure. The contents 
> are not to be disclosed to anyone other than the addressee. Unauthorized 
> recipients are requested to preserve this confidentiality and to advise the 
> sender immediately of any error in transmission."
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25104.php



Re: [OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-21 Thread Ralph Castain
Not sure I understand. The problem has been fixed in both the trunk and the 1.8 
branch now, so you should be able to work with either of those nightly builds.

On Aug 21, 2014, at 12:02 AM, Timur Ismagilov  wrote:

> Have i I any opportunity to run mpi jobs?
> 
> 
> Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain :
> yes, i know - it is cmr'd
> 
> On Aug 20, 2014, at 10:26 AM, Mike Dubman  wrote:
> 
>> btw, we get same error in v1.8 branch as well.
>> 
>> 
>> On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain  wrote:
>> It was not yet fixed - but should be now.
>> 
>> On Aug 20, 2014, at 6:39 AM, Timur Ismagilov  wrote:
>> 
>>> Hello!
>>> 
>>> As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still have 
>>> the problem
>>> 
>>> a)
>>> $ mpirun  -np 1 ./hello_c
>>> 
>>> --
>>> An ORTE daemon has unexpectedly failed after launch and before
>>> communicating back to mpirun. This could be caused by a number
>>> of factors, including an inability to create a connection back
>>> to mpirun due to a lack of common network interfaces and/or no
>>> route found between them. Please check network connectivity
>>> (including firewalls and network routing requirements).
>>> --
>>> 
>>> b)
>>> $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
>>> --
>>> An ORTE daemon has unexpectedly failed after launch and before
>>> communicating back to mpirun. This could be caused by a number
>>> of factors, including an inability to create a connection back
>>> to mpirun due to a lack of common network interfaces and/or no
>>> route found between them. Please check network connectivity
>>> (including firewalls and network routing requirements).
>>> --
>>> 
>>> c)
>>> 
>>> $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose 
>>> 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c
>>> 
>>> [compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
>>> [compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set 
>>> priority to 0
>>> [compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
>>> [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set 
>>> priority to 10
>>> [compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
>>> [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set 
>>> priority to 75
>>> [compiler-2:14673] mca:base:select:( plm) Selected component [slurm]
>>> [compiler-2:14673] mca: base: components_register: registering oob 
>>> components
>>> [compiler-2:14673] mca: base: components_register: found loaded component 
>>> tcp
>>> [compiler-2:14673] mca: base: components_register: component tcp register 
>>> function successful
>>> [compiler-2:14673] mca: base: components_open: opening oob components
>>> [compiler-2:14673] mca: base: components_open: found loaded component tcp
>>> [compiler-2:14673] mca: base: components_open: component tcp open function 
>>> successful
>>> [compiler-2:14673] mca:oob:select: checking available component tcp
>>> [compiler-2:14673] mca:oob:select: Querying component [tcp]
>>> [compiler-2:14673] oob:tcp: component_available called
>>> [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>>> [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
>>> [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>>> [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
>>> [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
>>> [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list 
>>> of V4 connections
>>> [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
>>> [compiler-2:14673] [[49095,0],0] TCP STARTUP
>>> [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0
>>> [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460
>>> [compiler-2:14673] mca:oob:select: Adding component to end
>>> [compiler-2:14673] mca:oob:select: Found 1 active transports
>>> [compiler-2:14673] mca: base: components_register: registering rml 
>>> components
>>> [compiler-2:14673] mca: base: components_register: found loaded component 
>>> oob
>>> [compiler-2:14673] mca: base: components_register: component oob has no 
>>> register or open function
>>> [compiler-2:14673] mca: base: components_open: opening rml components
>>> [compiler-2:14673] mca: base: components_open: found loaded component oob
>>> [compiler-2:14673] mca: base: components_open: component oob open function 
>>> successful
>>> [compiler-2:14673] orte_rml_base_select: initializing rml component oob
>>> [compiler-2:14673] 

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti
Hi,

Am 20.08.2014 um 20:08 schrieb Oscar Mojica:

> Well, with qconf -sq one.q I got the following:
> 
> [oscar@aguia free-noise]$ qconf -sq one.q
> qname one.q
> hostlist compute-1-30.local compute-1-2.local 
> compute-1-3.local \
>   compute-1-4.local compute-1-5.local compute-1-6.local \
>   compute-1-7.local compute-1-8.local compute-1-9.local \
>   compute-1-10.local compute-1-11.local 
> compute-1-12.local \
>   compute-1-13.local compute-1-14.local compute-1-15.local
> seq_no0
> load_thresholds np_load_avg=1.75
> suspend_thresholds  NONE
> nsuspend  1
> suspend_interval00:05:00
> priority0
> min_cpu_interval00:05:00
> processors UNDEFINED
> qtype BATCH INTERACTIVE
> ckpt_list   NONE
> pe_list make mpich mpi orte
> rerun FALSE
> slots  1,[compute-1-30.local=1],[compute-1-2.local=1], \
>   [compute-1-3.local=1],[compute-1-5.local=1], \
>   [compute-1-8.local=1],[compute-1-6.local=1], \
>   [compute-1-4.local=1],[compute-1-9.local=1], \
>   [compute-1-11.local=1],[compute-1-7.local=1], \
>   [compute-1-13.local=1],[compute-1-10.local=1], \
>   [compute-1-15.local=1],[compute-1-12.local=1], \
>   [compute-1-14.local=1]
> 
> the admin was who created this queue, so I have to speak to him to change the 
> number of slots to number of threads that i wish to use. 

Yep. I think it was his intention to allow an exclusive use of each node by 
this (this can be done in SGE by other means too). While one could do it, it 
doesn't reflect the proper amount of cores to SGE the user wants to use (it's 
more like the number of machines) and so any accounting won't work, or getting 
from `qacct` the correct information what the job requested at time he was 
submitted.


> Then I could make use of: 
> ===
> export OMP_NUM_THREADS=N 
> mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / 
> $OMP_NUM_THREADS") ./inverse.exe
> ==

As mentioned by tmishima, it's sufficient to use:

$ qsub -pe orte 80 ...

export OMP_NUM_THREADS=8
mpirun -map-by slot:pe=$OMP_NUM_THREADS ./yourapp.exe


=> you get a proper binding here, either if you are alone on each machine or 
all jobs get proper binding and Open MPI stays inside it (not all versions of 
SGE support this though)


> For now in my case this command line just would work for 10 processes and the 
> work wouldn't be divided in threads, is it right?

It works for 10 machines which you get exclusively, hence oversubscribing the 
granted single slot on each machine with "-bind-to none" what Ralph mentioned 
in the beginning is up to you (unless other users would get hurt as they are 
having there jobs too).

$ qsub -pe orte 10 ...

export OMP_NUM_THREADS=8
mpirun -bind-to none ./yourapp.exe


=> The OS will shift the processes around, while SGE doesn't know anything 
about the final number of slots/cores you want to use on each machine (or to 
leave free for others).

===

Both ways above work right now, but IMO it's not the optimum in a shared 
cluster for the SGE versions w/o hard-binding. In the second case Open MPI 
starts 1 process per node as we need it. In case you would request here `qsub 
-pe orte 80 ...` here too, Open MPI would start 80 processes. To avoid this I 
came up with altering the machinefile to give Open MPI a different information 
about the granted slots on each machine.

$ qsub -pe orte 80 ...

export OMP_NUM_THREADS=8
awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; print }' 
$PE_HOSTFILE > $TMPDIR/machines
export PE_HOSTFILE=$TMPDIR/machines
mpirun -bind-to none ./yourapp.exe

===

I hope having all three versions in one email sheds some light on it.

-- Reuti


> can I set a maximum number of threads in the queue one.q (e.g. 15 ) and 
> change the number in the 'export' for my convenience
> 
> I feel like a child hearing the adults speaking
> Thanks I'm learning a lot   
>   
> 
> Oscar Fabian Mojica Ladino
> Geologist M.S. in  Geophysics
> 
> 
> > From: re...@staff.uni-marburg.de
> > Date: Tue, 19 Aug 2014 19:51:46 +0200
> > To: us...@open-mpi.org
> > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> > 
> > Hi,
> > 
> > Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
> > 
> > > I discovered what was the error. I forgot include the '-fopenmp' when I 
> > > compiled the objects in the Makefile, so the program worked but it didn't 
> > > divide the job in threads. Now the program is working and I can use until 
> > > 15 cores for machine in the queue one.q.
> > > 
> > > Anyway i would like to try implement your advice. Well I'm not alone in 
> > > the cluster so i must implement your second suggestion. The steps 

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti
Hi,

Am 21.08.2014 um 01:56 schrieb tmish...@jcity.maeda.co.jp:

> Reuti,
> 
> Sorry for confusing you. Under the managed condition, actually
> -np option is not necessary. So, this cmd line also works for me
> with Torque.
> 
> $ qsub -l nodes=10:ppn=N
> $ mpirun -map-by slot:pe=N ./inverse.exe

Aha, yes. Works in SGE too.

To make the notation of threads generic, what about an extension to use:

-map-by slot:pe=omp

where the literal "omp" will trigger to use $OMP_NUM_THREADS instead?

-- Reuti


> At least, Ralph confirmed it worked with Slurm and I comfirmed
> with Torque as shown below:
> 
> [mishima@manage ~]$ qsub -I -l nodes=4:ppn=8
> qsub: waiting for job 8798.manage.cluster to start
> qsub: job 8798.manage.cluster ready
> 
> [mishima@node09 ~]$ cat $PBS_NODEFILE
> node09
> node09
> node09
> node09
> node09
> node09
> node09
> node09
> node10
> node10
> node10
> node10
> node10
> node10
> node10
> node10
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> node12
> node12
> node12
> node12
> node12
> node12
> node12
> node12
> [mishima@node09 ~]$ mpirun -map-by slot:pe=8 -display-map
> ~/mis/openmpi/demos/myprog
> Data for JOB [8050,1] offset 0
> 
>    JOB MAP   
> 
> Data for node: node09  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [8050,1] App: 0 Process rank: 0
> 
> Data for node: node10  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [8050,1] App: 0 Process rank: 1
> 
> Data for node: node11  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [8050,1] App: 0 Process rank: 2
> 
> Data for node: node12  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [8050,1] App: 0 Process rank: 3
> 
> =
> Hello world from process 0 of 4
> Hello world from process 2 of 4
> Hello world from process 3 of 4
> Hello world from process 1 of 4
> [mishima@node09 ~]$ mpirun -map-by slot:pe=4 -display-map
> ~/mis/openmpi/demos/myprog
> Data for JOB [8056,1] offset 0
> 
>    JOB MAP   
> 
> Data for node: node09  Num slots: 8Max slots: 0Num procs: 2
>Process OMPI jobid: [8056,1] App: 0 Process rank: 0
>Process OMPI jobid: [8056,1] App: 0 Process rank: 1
> 
> Data for node: node10  Num slots: 8Max slots: 0Num procs: 2
>Process OMPI jobid: [8056,1] App: 0 Process rank: 2
>Process OMPI jobid: [8056,1] App: 0 Process rank: 3
> 
> Data for node: node11  Num slots: 8Max slots: 0Num procs: 2
>Process OMPI jobid: [8056,1] App: 0 Process rank: 4
>Process OMPI jobid: [8056,1] App: 0 Process rank: 5
> 
> Data for node: node12  Num slots: 8Max slots: 0Num procs: 2
>Process OMPI jobid: [8056,1] App: 0 Process rank: 6
>Process OMPI jobid: [8056,1] App: 0 Process rank: 7
> 
> =
> Hello world from process 1 of 8
> Hello world from process 0 of 8
> Hello world from process 2 of 8
> Hello world from process 3 of 8
> Hello world from process 4 of 8
> Hello world from process 5 of 8
> Hello world from process 6 of 8
> Hello world from process 7 of 8
> 
> I don't know why it dosen't work with SGE. Could you show me
> your output adding -display-map and -mca rmaps_base_verbose 5 options?
> 
> By the way, the option -map-by ppr:N:node or ppr:N:socket might be
> useful for your purpose. The ppr can reduce the slot counts given
> by RM without binding and allocate N procs by the specified resource.
> 
> [mishima@node09 ~]$ mpirun -map-by ppr:1:node -display-map
> ~/mis/openmpi/demos/myprog
> Data for JOB [7913,1] offset 0
> 
>    JOB MAP   
> 
> Data for node: node09  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [7913,1] App: 0 Process rank: 0
> 
> Data for node: node10  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [7913,1] App: 0 Process rank: 1
> 
> Data for node: node11  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [7913,1] App: 0 Process rank: 2
> 
> Data for node: node12  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [7913,1] App: 0 Process rank: 3
> 
> =
> Hello world from process 0 of 4
> Hello world from process 2 of 4
> Hello world from process 1 of 4
> Hello world from process 3 of 4
> 
> Tetsuya
> 
> 
>> Hi,
>> 
>> Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp:
>> 
>>> Reuti,
>>> 
>>> If you want to allocate 10 procs with N threads, the Torque
>>> script below should work for you:
>>> 
>>> qsub -l nodes=10:ppn=N
>>> mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe
>> 
>> I played around with giving -np 10 in addition to a Tight Integration.
> The slot 

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti
Am 20.08.2014 um 23:16 schrieb Ralph Castain:

> 
> On Aug 20, 2014, at 11:16 AM, Reuti  wrote:
> 
>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
>> 
 
 Aha, this is quite interesting - how do you do this: scanning the 
 /proc//status or alike? What happens if you don't find enough free 
 cores as they are used up by other applications already?
 
>>> 
>>> Remember, when you use mpirun to launch, we launch our own daemons using 
>>> the native launcher (e.g., qsub). So the external RM will bind our daemons 
>>> to the specified cores on each node. We use hwloc to determine what cores 
>>> our daemons are bound to, and then bind our own child processes to cores 
>>> within that range.
>> 
>> Thx for reminding me of this. Indeed, I mixed up two different aspects in 
>> this discussion.
>> 
>> a) What will happen in case no binding was done by the RM (hence Open MPI 
>> could use all cores) and two Open MPI jobs (or something completely 
>> different besides one Open MPI job) are running on the same node (due to the 
>> Tight Integration with two different Open MPI directories in /tmp and two 
>> `orted`, unique for each job)? Will the second Open MPI job know what the 
>> first Open MPI job used up already? Or will both use the same set of cores 
>> as "-bind-to none" can't be set in the given `mpiexec` command because of 
>> "-map-by slot:pe=$OMP_NUM_THREADS" was used - which triggers "-bind-to core" 
>> indispensable and can't be switched off? I see the same cores being used for 
>> both jobs.
> 
> Yeah, each mpirun executes completely independently of the other, so they 
> have no idea what the other is doing. So the cores will be overloaded. 
> Multi-pe's requires bind-to-core otherwise there is no way to implement the 
> request

Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
"-bind-to none" here?


>> Altering the machinefile instead: the processes are not bound to any core, 
>> and the OS takes care of a proper assignment.

Here the ordinary user has to mangle the hostfile, this is not good (but allows 
several jobs per node as the OS shift the processes around). Could/should it be 
put into the "gridengine" module in OpenMPI, to divide the slot count per node 
automatically when $OMP_NUM_THREADS is found, or generate an error if it's not 
divisible?

===

>>> If the cores we are bound to are the same on each node, then we will do 
>>> this with no further instruction. However, if the cores are different on 
>>> the individual nodes, then you need to add --hetero-nodes to your command 
>>> line (as the nodes appear to be heterogeneous to us).
>> 
>> b) Aha, it's not about different type CPU types, but also same CPU type but 
>> different allocations between the nodes? It's not in the `mpiexec` man-page 
>> of 1.8.1 though. I'll have a look at it.

I tried:

$ qsub -binding linear:2:0 -pe smp2 8 -masterq parallel@node01 -q 
parallel@node0[1-4] test_openmpi.sh 
Your job 247109 ("test_openmpi.sh") has been submitted
$ qsub -binding linear:2:1 -pe smp2 8 -masterq parallel@node01 -q 
parallel@node0[1-4] test_openmpi.sh 
Your job 247110 ("test_openmpi.sh") has been submitted


Getting on node03:


 6733 ?Sl 0:00  \_ sge_shepherd-247109 -bg
 6734 ?SNs0:00  |   \_ /usr/sge/utilbin/lx24-amd64/qrsh_starter 
/var/spool/sge/node03/active_jobs/247109.1/1.node03
 6741 ?SN 0:00  |   \_ orted -mca orte_hetero_nodes 1 -mca ess 
env -mca orte_ess_jobid 1493303296 -mca orte_ess_vpid
 6742 ?RNl0:31  |   \_ ./mpihello
 6745 ?Sl 0:00  \_ sge_shepherd-247110 -bg
 6746 ?SNs0:00  \_ /usr/sge/utilbin/lx24-amd64/qrsh_starter 
/var/spool/sge/node03/active_jobs/247110.1/1.node03
 6753 ?SN 0:00  \_ orted -mca orte_hetero_nodes 1 -mca ess 
env -mca orte_ess_jobid 1506607104 -mca orte_ess_vpid
 6754 ?RNl0:25  \_ ./mpihello


reuti@node03:~> cat /proc/6741/status | grep Cpus_
Cpus_allowed:   
,,,,,,,,,,,,,,,0003
Cpus_allowed_list:  0-1
reuti@node03:~> cat /proc/6753/status | grep Cpus_
Cpus_allowed:   
,,,,,,,,,,,,,,,0030
Cpus_allowed_list:  4-5

Hence, "orted" got two cores assigned for each of them. But:


reuti@node03:~> cat /proc/6742/status | grep Cpus_
Cpus_allowed:   
,,,,,,,,,,,,,,,0003
Cpus_allowed_list:  0-1
reuti@node03:~> cat /proc/6754/status | grep Cpus_
Cpus_allowed:   
,,,,,,,,,,,,,,,0003
Cpus_allowed_list:  

Re: [OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-21 Thread Timur Ismagilov
 Have i I any opportunity to run mpi jobs?


Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain :
>yes, i know - it is cmr'd
>
>On Aug 20, 2014, at 10:26 AM, Mike Dubman < mi...@dev.mellanox.co.il > wrote:
>>btw, we get same error in v1.8 branch as well.
>>
>>
>>On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain  < r...@open-mpi.org > wrote:
>>>It was not yet fixed - but should be now.
>>>
>>>On Aug 20, 2014, at 6:39 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
Hello!

As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still have 
the problem

a)
$ mpirun  -np 1 ./hello_c
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--
b)
$ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--

c)

$ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose 
5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c
[compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
[compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set 
priority to 0
[compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
[compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set 
priority to 10
[compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
[compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set 
priority to 75
[compiler-2:14673] mca:base:select:( plm) Selected component [slurm]
[compiler-2:14673] mca: base: components_register: registering oob 
components
[compiler-2:14673] mca: base: components_register: found loaded component 
tcp
[compiler-2:14673] mca: base: components_register: component tcp register 
function successful
[compiler-2:14673] mca: base: components_open: opening oob components
[compiler-2:14673] mca: base: components_open: found loaded component tcp
[compiler-2:14673] mca: base: components_open: component tcp open function 
successful
[compiler-2:14673] mca:oob:select: checking available component tcp
[compiler-2:14673] mca:oob:select: Querying component [tcp]
[compiler-2:14673] oob:tcp: component_available called
[compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
[compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
[compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list 
of V4 connections
[compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
[compiler-2:14673] [[49095,0],0] TCP STARTUP
[compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0
[compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460
[compiler-2:14673] mca:oob:select: Adding component to end
[compiler-2:14673] mca:oob:select: Found 1 active transports
[compiler-2:14673] mca: base: components_register: registering rml 
components
[compiler-2:14673] mca: base: components_register: found loaded component 
oob
[compiler-2:14673] mca: base: components_register: component oob has no 
register or open function
[compiler-2:14673] mca: base: components_open: opening rml components
[compiler-2:14673] mca: base: components_open: found loaded component oob
[compiler-2:14673] mca: base: components_open: component oob open function 
successful
[compiler-2:14673] orte_rml_base_select: initializing rml component oob
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for peer 

Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-21 Thread Filippo Spiga
Hi Joshua,

On Aug 21, 2014, at 12:28 AM, Joshua Ladd  wrote:
> When launching with mpirun in a SLURM environment, srun is only being used to 
> launch the ORTE daemons (orteds.)  Since the daemon will already exist on the 
> node from which you invoked mpirun, this node will not be included in the 
> list of nodes. SLURM's PMI library is not involved (that functionality is 
> only necessary if you directly launch your MPI application with srun, in 
> which case it is used to exchanged wireup info amongst slurmds.) This is the 
> expected behavior. 
> 
> ~/ompi-top-level/orte/mca/plm/plm_slurm_module.c +294
> /* if the daemon already exists on this node, then
>  * don't include it
>  */
> if (node->daemon_launched) {
> continue;
> }
> 
> Do you have a frontend node that you can launch from? What happens if you set 
> "-np X" where X = 8*ppn. The alternative is to do a direct launch of the MPI 
> application with srun.

I understand the logic and I understand with orted in the first node is not 
needed. But since we use a batch system (SLURM) we do not want people to run 
their mpirun commands directly fon a front-end. Typical scenario: all compute 
node are running fine but we reboot all the login nodes to upgrade the linux 
image because of a security update the kernel. We can keep the login nodes 
offline potentially for hours without stop the system to work. 

From our perspective, a front-end node is an additional burden. Of course login 
node and front-end node can be two separated hosts but I am looking for a way 
to keep our setup as-it-is without introducing structural changes. 


Hi Ralph,

On Aug 21, 2014, at 12:36 AM, Ralph Castain  wrote:
> Or you can add 
> 
>-nolocal|--nolocalDo not run any MPI applications on the local node
> 
> to your mpirun command line and we won't run any application procs on the 
> node where mpirun is executing

I tried but of course but mpirun complains. If it cannot run local (meaning on 
the first node, tesla121) then only 7 nodes remains and I request in total 8. 
So to use "--nolocal" I need to add another nodes. Since we allocate node 
exclusively and for some users we charge the usage real money... this is not 
ideal I am afraid.


srun seems the only solution to go. I need to understand how to pass most of 
the --mca parameters to srun and to be sure I can pilot  rmaps_lama_* options 
as flexible as I do with normal mpirun. Then there are mxm, fca, hcollI am 
not against srun in principle, my only stopping point it that the syntax is 
only different that we might receive lot (too many) complains our users in 
adopting this new way to submit because they are used to use classic mpirun 
inside a sbatch script. Most of them will probably not switch to a different 
method! So our hope to "silently" profile network, energy, I/O using SLURM 
plugins also using Open MPI is lost...

F

--
Mr. Filippo SPIGA, M.Sc.
http://filippospiga.info ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

*
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and 
may be privileged or otherwise protected from disclosure. The contents are not 
to be disclosed to anyone other than the addressee. Unauthorized recipients are 
requested to preserve this confidentiality and to advise the sender immediately 
of any error in transmission."