Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

2022-02-03 Thread Ralph Castain via users
 mpirun -rf rankfile -report-bindings --mca rmaps_rank_file_physical 1 echo ""
--
 WARNING: Open MPI tried to bind a process but failed.  This is a
 warning only; your job will continue, though performance may
 be degraded.
 
   Local host:    4b9bc4c4f40b
   Application name:  /usr/bin/echo
   Error message: failed to bind memory
   Location:  rtc_hwloc.c:447
 
--
 [4b9bc4c4f40b:00041] MCW rank 0 bound to socket 0[core 1[hwt 0-1]], socket 
0[core 3[hwt 0-1]]: [../BB/../BB/../..]
 [4b9bc4c4f40b:00041] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 3[hwt 0-1]]: [BB/../../BB/../..]
 
 
 [4b9bc4c4f40b:00041] 1 more process has sent help message 
help-orte-odls-default.txt / memory not bound
 [4b9bc4c4f40b:00041] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages


 

(Just ignore the error messages as they seem to be caused by the fact that I'm 
inside a docker container, as explained here 
https://github.com/open-mpi/ompi/issues/7368)


 

If I omit the mca parameter for physical indexes, however, I get nothing in 
return:

$ mpirun -rf rankfile -report-bindings echo ""
 $

and if I create a rankfile with logical indexes that should be equivalent to 
the other


 

$ cat rankfile_logical 
 rank 0=localhost slot=0:2,6
 rank 1=localhost slot=0:1,7
 

then I get following error message:
 


 user@4b9bc4c4f40b:~$ mpirun -rf rankfile_logical -report-bindings echo ""
 [4b9bc4c4f40b:00058] [[15180,0],0] ORTE_ERROR_LOG: Not found in file 
rmaps_rank_file.c at line 333
 [4b9bc4c4f40b:00058] [[15180,0],0] ORTE_ERROR_LOG: Not found in file 
base/rmaps_base_map_job.c at line 402
 
 

If I remove the socket index:

$ cat rankfile_logical 
 rank 0=localhost slot=2,6
 rank 1=localhost slot=1,7
 
 
 $ mpirun -rf rankfile_logical -report-bindings echo ""

$


 

I get again nothing in return. So I start thinking that I may indeed be missing 
something. Do you see where, by chance?

Also, on the cluster hwloc is not available (and so lstopo). numactl seems to 
only show physical locations. Do you know another tool that could help in 
getting logical ids of the allocated cores?
 


 

You then write:
 

 
 

It also appears from your output that you are using hwthreads as cpus, so the 
slot descriptions are being applied to threads and not cores. At least, it 
appears that way to me - was that expected?

 


 

No, that was actually not expected. All the AMD-based nodes on the cluster 
actually have 1thread/core, while the intel-based nodes have hyperthreading 
activated. The example I've sent before was on an AMD node. Maybe a stupid 
question, but how can I then consider cpus? From the manpages I thought this is 
the default behaviour.


 

By the way, if I'll manage to understand everything correctly, I can also 
contribute to fix these inconsistencies in the manpages. I'd be more than happy 
to help where I can.


 


 


 

 

 
 

 
 
On 03.02.22 09:50, Ralph Castain via users wrote:
 
 

Hmmm...okay, I found the code path that fails without an error - not one of the 
ones I was citing. Thanks for that detailed explanation of what you were doing! 
I'll add some code to the master branch to plug that hole along with the other 
I identified.

Just an FYI: we stopped supporting "physical" cpus a long time ago, so the 
"rmaps_rank_file_physical" MCA param is just being ignored (we don't have a way 
to detect that the param you cited doesn't exist). We only take the input as 
being "logical" cpu designations. You might check, but I suspect the two 
(logical and physical IDs) are the same here.

It also appears from your output that you are using hwthreads as cpus, so the 
slot descriptions are being applied to threads and not cores. At least, it 
appears that way to me - was that expected?





 

On Feb 3, 2022, at 12:27 AM, David Perozzi  wrote:

Thanks for looking into that and sorry if I only included the version in use in 
the pastebin. I'll ask the cluster support if they could install OMPI master.

I really am unfamiliar with openmpi's codebase, so I haven't looked into it and 
are very thanful that you could already identify possible places that I 
could've "visited". One thing that I can add, however, is that I tried both on 
the cluster (OMPI 4.0.2) and on my local machine (OMPI 4.0.3) to run a dummy 
test, which basically consists in launching the following:

$ mpirun -rf rankfile -report-bindings --mca rmaps_rank_file_physical 1 echo ""

I report here the results coming from the cluster, where I allocated 6 cores, 
all on the same node:

$ numactl --show
policy: default
preferred node: current
physcpubind: 3 11 12 13 21 29
cpubind: 0 1
nodebind: 0 1
membind: 0 1 2 3 4 5 6 7

$ hostname
eu-g1-018-1


$ mpirun -rf 

Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

2022-02-03 Thread Ralph Castain via users
Hmmm...okay, I found the code path that fails without an error - not one of the 
ones I was citing. Thanks for that detailed explanation of what you were doing! 
I'll add some code to the master branch to plug that hole along with the other 
I identified.

Just an FYI: we stopped supporting "physical" cpus a long time ago, so the 
"rmaps_rank_file_physical" MCA param is just being ignored (we don't have a way 
to detect that the param you cited doesn't exist). We only take the input as 
being "logical" cpu designations. You might check, but I suspect the two 
(logical and physical IDs) are the same here.

It also appears from your output that you are using hwthreads as cpus, so the 
slot descriptions are being applied to threads and not cores. At least, it 
appears that way to me - was that expected?



> On Feb 3, 2022, at 12:27 AM, David Perozzi  wrote:
> 
> Thanks for looking into that and sorry if I only included the version in use 
> in the pastebin. I'll ask the cluster support if they could install OMPI 
> master.
> 
> I really am unfamiliar with openmpi's codebase, so I haven't looked into it 
> and are very thanful that you could already identify possible places that I 
> could've "visited". One thing that I can add, however, is that I tried both 
> on the cluster (OMPI 4.0.2) and on my local machine (OMPI 4.0.3) to run a 
> dummy test, which basically consists in launching the following:
> 
> $ mpirun -rf rankfile -report-bindings --mca rmaps_rank_file_physical 1 echo 
> ""
> 
> I report here the results coming from the cluster, where I allocated 6 cores, 
> all on the same node:
> 
> $ numactl --show
> policy: default
> preferred node: current
> physcpubind: 3 11 12 13 21 29
> cpubind: 0 1
> nodebind: 0 1
> membind: 0 1 2 3 4 5 6 7
> 
> $ hostname
> eu-g1-018-1
> 
> 
> $ mpirun -rf rankfile -report-bindings --mca rmaps_rank_file_physical 1 echo 
> ""
> 
> [eu-g1-018-1:37621] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 
> 0[core 1[hwt 0]]: [B/B/./././.][]
> [eu-g1-018-1:37621] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 
> 0[core 4[hwt 0]]: [././B/./B/.][]
> 
> $ cat rankfile
> rank 0=eu-g1-018-1 slot=3,11
> rank 1=eu-g1-018-1 slot=12,21
> 
> However, if I change the rankfile to use an unavailable core location, e.g.
> 
> $ cat rankfile
> rank 0=eu-g1-018-1 slot=3,11
> rank 1=eu-g1-018-1 slot=12,28
> 
> I get no error message in return:
> 
> $ mpirun -rf rankfile -report-bindings --mca rmaps_rank_file_physical 1 echo 
> ""
> $
> 
> So, at least in this version, it is possible to get no error message in 
> return quite easily (but this is maybe one of the error you said should never 
> happen).
> 
> I'll double (triple) check my python script that generates the rankfile 
> again, but as of now I'm pretty sure no nasty things should happen at that 
> level. Especially because in the case reported in my initial message one can 
> manually check that all locations are indeed allocated to my job (by 
> comparing the rankfile and the allocation.txt file).
> 
> I was wondering if somehow mpirun cannot find all the hosts sometimes (but 
> sometimes it can, so it's a mistery to me)?
> 
> Just wanted to point that out. Now I'll get in touch with the cluster support 
> to see if it's possible to test on master.
> 
> Cheers,
> David
> 
> On 03.02.22 01:59, Ralph Castain via users wrote:
>> Are you willing to try this with OMPI master? Asking because it would be 
>> hard to push changes all the way back to 4.0.x every time we want to see if 
>> we fixed something.
>> 
>> Also, few of us have any access to LSF, though I doubt that has much impact 
>> here as it sounds like the issue is in the rank_file mapper.
>> 
>> Glancing over the rank_file mapper in master branch, I only see a couple of 
>> places (both errors that should never happen) that wouldn't result in a 
>> gaudy "show help" message. It would be interesting to know if you are 
>> hitting those.
>> 
>> One way you could get more debug info is to ensure the OMPI is configure 
>> with --enable-debug and then add "--mca rmaps_base_verbose 5" to your cmd 
>> line.
>> 
>> 
>>> On Feb 2, 2022, at 3:46 PM, Christoph Niethammer  wrote:
>>> 
>>> The linked pastebin includes the following version information:
>>> 
>>> [1,0]:package:Open MPI spackapps@eu-c7-042-03 Distribution
>>> [1,0]:ompi:version:full:4.0.2
>>> [1,0]:ompi:version:repo:v4.0.2
>>> [1,0]:ompi:version:release_date:Oct 07, 2019
>>

Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

2022-02-02 Thread Ralph Castain via users
Are you willing to try this with OMPI master? Asking because it would be hard 
to push changes all the way back to 4.0.x every time we want to see if we fixed 
something.

Also, few of us have any access to LSF, though I doubt that has much impact 
here as it sounds like the issue is in the rank_file mapper.

Glancing over the rank_file mapper in master branch, I only see a couple of 
places (both errors that should never happen) that wouldn't result in a gaudy 
"show help" message. It would be interesting to know if you are hitting those.

One way you could get more debug info is to ensure the OMPI is configure with 
--enable-debug and then add "--mca rmaps_base_verbose 5" to your cmd line.


> On Feb 2, 2022, at 3:46 PM, Christoph Niethammer  wrote:
> 
> The linked pastebin includes the following version information:
> 
> [1,0]:package:Open MPI spackapps@eu-c7-042-03 Distribution
> [1,0]:ompi:version:full:4.0.2
> [1,0]:ompi:version:repo:v4.0.2
> [1,0]:ompi:version:release_date:Oct 07, 2019
> [1,0]:orte:version:full:4.0.2
> [1,0]:orte:version:repo:v4.0.2
> [1,0]:orte:version:release_date:Oct 07, 2019
> [1,0]:opal:version:full:4.0.2
> [1,0]:opal:version:repo:v4.0.2
> [1,0]:opal:version:release_date:Oct 07, 2019
> [1,0]:mpi-api:version:full:3.1.0
> [1,0]:ident:4.0.2
> 
> Best
> Christoph
> 
> - Original Message -
> From: "Open MPI Users" 
> To: "Open MPI Users" 
> Cc: "Ralph Castain" 
> Sent: Thursday, 3 February, 2022 00:22:30
> Subject: Re: [OMPI users] Error using rankfile to bind multiple cores on the 
> same node for threaded OpenMPI application
> 
> Errr...what version OMPI are you using?
> 
>> On Feb 2, 2022, at 3:03 PM, David Perozzi via users 
>>  wrote:
>> 
>> Helo,
>> 
>> I'm trying to run a code implemented with OpenMPI and OpenMP (for threading) 
>> on a large cluster that uses LSF for the job scheduling and dispatch. The 
>> problem with LSF is that it is not very straightforward to allocate and bind 
>> the right amount of threads to an MPI rank inside a single node. Therefore, 
>> I have to create a rankfile myself, as soon as the (a priori unknown) 
>> ressources are allocated.
>> 
>> So, after my job get dispatched, I run:
>> 
>> mpirun -n "$nslots" -display-allocation -nooversubscribe --map-by core:PE=1 
>> --bind-to core mpi_allocation/show_numactl.sh 
>> >mpi_allocation/allocation_files/allocation.txt
>> 
>> where show_numactl.sh consists of just one line:
>> 
>> { hostname; numactl --show; } | sed ':a;N;s/\n/ /;ba'
>> 
>> If I ask for 16 slots, in blocks of 4 (i.e., bsub -n 16 -R "span[block=4]"), 
>> I get something like:
>> 
>> ==   ALLOCATED NODES   ==
>>eu-g1-006-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
>>eu-g1-009-2: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
>>eu-g1-002-3: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
>>eu-g1-005-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
>> =
>> eu-g1-006-1 policy: default preferred node: current physcpubind: 16  
>> cpubind: 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
>> eu-g1-006-1 policy: default preferred node: current physcpubind: 24  
>> cpubind: 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
>> eu-g1-006-1 policy: default preferred node: current physcpubind: 32  
>> cpubind: 2  nodebind: 2  membind: 0 1 2 3 4 5 6 7
>> eu-g1-002-3 policy: default preferred node: current physcpubind: 21  
>> cpubind: 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
>> eu-g1-002-3 policy: default preferred node: current physcpubind: 22  
>> cpubind: 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
>> eu-g1-009-2 policy: default preferred node: current physcpubind: 0  cpubind: 
>> 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
>> eu-g1-009-2 policy: default preferred node: current physcpubind: 1  cpubind: 
>> 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
>> eu-g1-009-2 policy: default preferred node: current physcpubind: 2  cpubind: 
>> 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
>> eu-g1-002-3 policy: default preferred node: current physcpubind: 19  
>> cpubind: 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
>> eu-g1-002-3 policy: default preferred node: current physcpubind: 23  
>> cpubind: 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
>> eu-g1-006-1 policy: default preferred node: current physcpubind: 52  
>> cpubind: 3  nodebind: 3  membind: 0 1 2 3 4 5 6 7
>> eu-g1-009-2 policy: default preferred node: current physcpubind: 3  cpubind: 
>> 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
>> eu-g1-005-1 policy: default preferred node: current physcpubind: 90  
>> cpubind: 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
>> eu-g1-005-1 policy: default preferred node: current physcpubind: 91  
>> cpubind: 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
>> eu-g1-005-1 policy: default preferred node: current physcpubind: 94  
>> cpubind: 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
>> eu-g1-005-1 policy: default preferred node: current physcpubind: 95  
>> cpubi

Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

2022-02-02 Thread Ralph Castain via users
Errr...what version OMPI are you using?

> On Feb 2, 2022, at 3:03 PM, David Perozzi via users 
>  wrote:
> 
> Helo,
> 
> I'm trying to run a code implemented with OpenMPI and OpenMP (for threading) 
> on a large cluster that uses LSF for the job scheduling and dispatch. The 
> problem with LSF is that it is not very straightforward to allocate and bind 
> the right amount of threads to an MPI rank inside a single node. Therefore, I 
> have to create a rankfile myself, as soon as the (a priori unknown) 
> ressources are allocated.
> 
> So, after my job get dispatched, I run:
> 
> mpirun -n "$nslots" -display-allocation -nooversubscribe --map-by core:PE=1 
> --bind-to core mpi_allocation/show_numactl.sh 
> >mpi_allocation/allocation_files/allocation.txt
> 
> where show_numactl.sh consists of just one line:
> 
> { hostname; numactl --show; } | sed ':a;N;s/\n/ /;ba'
> 
> If I ask for 16 slots, in blocks of 4 (i.e., bsub -n 16 -R "span[block=4]"), 
> I get something like:
> 
> ==   ALLOCATED NODES   ==
> eu-g1-006-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
> eu-g1-009-2: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
> eu-g1-002-3: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
> eu-g1-005-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
> =
> eu-g1-006-1 policy: default preferred node: current physcpubind: 16  cpubind: 
> 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
> eu-g1-006-1 policy: default preferred node: current physcpubind: 24  cpubind: 
> 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
> eu-g1-006-1 policy: default preferred node: current physcpubind: 32  cpubind: 
> 2  nodebind: 2  membind: 0 1 2 3 4 5 6 7
> eu-g1-002-3 policy: default preferred node: current physcpubind: 21  cpubind: 
> 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
> eu-g1-002-3 policy: default preferred node: current physcpubind: 22  cpubind: 
> 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
> eu-g1-009-2 policy: default preferred node: current physcpubind: 0  cpubind: 
> 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
> eu-g1-009-2 policy: default preferred node: current physcpubind: 1  cpubind: 
> 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
> eu-g1-009-2 policy: default preferred node: current physcpubind: 2  cpubind: 
> 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
> eu-g1-002-3 policy: default preferred node: current physcpubind: 19  cpubind: 
> 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
> eu-g1-002-3 policy: default preferred node: current physcpubind: 23  cpubind: 
> 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
> eu-g1-006-1 policy: default preferred node: current physcpubind: 52  cpubind: 
> 3  nodebind: 3  membind: 0 1 2 3 4 5 6 7
> eu-g1-009-2 policy: default preferred node: current physcpubind: 3  cpubind: 
> 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
> eu-g1-005-1 policy: default preferred node: current physcpubind: 90  cpubind: 
> 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
> eu-g1-005-1 policy: default preferred node: current physcpubind: 91  cpubind: 
> 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
> eu-g1-005-1 policy: default preferred node: current physcpubind: 94  cpubind: 
> 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
> eu-g1-005-1 policy: default preferred node: current physcpubind: 95  cpubind: 
> 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
> 
> After that, I parse this allocation file in python and I create a hostfile 
> and a rankfile.
> 
> The hostfile reads:
> 
> eu-g1-006-1
> eu-g1-009-2
> eu-g1-002-3
> eu-g1-005-1
> 
> The rankfile:
> 
> rank 0=eu-g1-006-1 slot=16,24,32,52
> rank 1=eu-g1-009-2 slot=0,1,2,3
> rank 2=eu-g1-002-3 slot=21,22,19,23
> rank 3=eu-g1-005-1 slot=90,91,94,95
> 
> Following OpenMPI's manpages and FAQs, I then run my application using
> 
> mpirun -n "$nmpiproc" --rankfile mpi_allocation/hostfiles/rankfile --mca 
> rmaps_rank_file_physical 1 ./build/"$executable_name" true "$input_file"
> 
> where the bash variables are passed in directly in the bsub command (I 
> basically run bsub -n 16 -R "span[block=4]" "my_script.sh num_slots 
> num_thread_per_rank executable_name input_file").
> 
> 
> Now, this procedure sometimes works just fine, sometimes not. When it 
> doesn't, the problem is that I don't get any error message (I noticed that if 
> an error is made inside the rankfile, one does not get any error). Strangely, 
> it seems that for 16 slots and four threads (so 4 MPI ranks), it works better 
> if I have 8 slots allocated in two nodes than if I have 4 slots in 4 
> different nodes. My goal is tu run the application with 256 slots and 32 
> threads per rank (the cluster has mainly AMD EPYC based nodes).
> 
> The ompi information of the nodes running a failed job and the rankfile for 
> that failed job can be found at https://pastebin.com/40f6FigH and the 
> allocation file at https://pastebin.com/jeWnkU40
> 
> 
> Do you see any problem with my procedure? Why is it failing seemingly 
>

Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Ralph Castain via users
Sure - but then we aren't talking about containers any more, just vendor vs 
OMPI. I'm not getting in the middle of that one!


On Jan 27, 2022, at 6:28 PM, Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org> > wrote:

Thanks Ralph,

Now I get what you had in mind.

Strictly speaking, you are making the assumption that Open MPI performance 
matches the system MPI performances.

This is generally true for common interconnects and/or those that feature 
providers for libfabric or UCX, but not so for "exotic" interconnects (that 
might not be supported natively by Open MPI or abstraction layers) and/or with 
an uncommon topology (for which collective communications are not fully 
optimized by Open MPI). In the latter case, using the system/vendor MPI is the 
best option performance wise.

Cheers,

Gilles

On Fri, Jan 28, 2022 at 2:23 AM Ralph Castain via users 
mailto:users@lists.open-mpi.org> > wrote:
Just to complete this - there is always a lingering question regarding shared 
memory support. There are two ways to resolve that one:

* run one container per physical node, launching multiple procs in each 
container. The procs can then utilize shared memory _inside_ the container. 
This is the cleanest solution (i.e., minimizes container boundary violations), 
but some users need/want per-process isolation.

* run one container per MPI process, having each container then mount an 
_external_ common directory to an internal mount point. This allows each 
process to access the common shared memory location. As with the device 
drivers, you typically specify that external mount location when launching the 
container.

Using those combined methods, you can certainly have a "generic" container that 
suffers no performance impact from bare metal. The problem has been that it 
takes a certain degree of "container savvy" to set this up and make it work - 
which is beyond what most users really want to learn. I'm sure the container 
community is working on ways to reduce that burden (I'm not really plugged into 
those efforts, but others on this list might be).

Ralph


> On Jan 27, 2022, at 7:39 AM, Ralph H Castain  <mailto:r...@open-mpi.org> > wrote:
> 
>> Fair enough Ralph! I was implicitly assuming a "build once / run everywhere" 
>> use case, my bad for not making my assumption clear.
>> If the container is built to run on a specific host, there are indeed other 
>> options to achieve near native performances.
>> 
> 
> Err...that isn't actually what I meant, nor what we did. You can, in fact, 
> build a container that can "run everywhere" while still employing high-speed 
> fabric support. What you do is:
> 
> * configure OMPI with all the fabrics enabled (or at least all the ones you 
> care about)
> 
> * don't include the fabric drivers in your container. These can/will vary 
> across deployments, especially those (like NVIDIA's) that involve kernel 
> modules
> 
> * setup your container to mount specified external device driver locations 
> onto the locations where you configured OMPI to find them. Sadly, this does 
> violate the container boundary - but nobody has come up with another 
> solution, and at least the violation is confined to just the device drivers. 
> Typically, you specify the external locations that are to be mounted using an 
> envar or some other mechanism appropriate to your container, and then include 
> the relevant information when launching the containers.

> 
> When OMPI initializes, it will do its normal procedure of attempting to load 
> each fabric's drivers, selecting the transports whose drivers it can load. 
> NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to build 
> without statically linking in the fabric plugins or else this probably will 
> fail.
> 
> At least one vendor now distributes OMPI containers preconfigured with their 
> fabric support based on this method. So using a "generic" container doesn't 
> mean you lose performance - in fact, our tests showed zero impact on 
> performance using this method.
> 
> HTH
> Ralph
> 





Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Ralph Castain via users
G2019.pdf 
<https://openpmix.github.io/uploads/2019/04/PMIxSUG2019.pdf> 
[PPX] https://www.slideshare.net/rcastain/pmix-bridging-the-container-boundary 
<https://www.slideshare.net/rcastain/pmix-bridging-the-container-boundary> 
[video] 
https://www.sylabs.io/2019/04/sug-talk-intels-ralph-castain-on-bridging-the-container-boundary-with-pmix/
 
<https://www.sylabs.io/2019/04/sug-talk-intels-ralph-castain-on-bridging-the-container-boundary-with-pmix/>
 



  Thanks again,
  - Brian


On Thu, Jan 27, 2022 at 10:22 AM Ralph Castain via users 
mailto:users@lists.open-mpi.org> > wrote:
Just to complete this - there is always a lingering question regarding shared 
memory support. There are two ways to resolve that one:

* run one container per physical node, launching multiple procs in each 
container. The procs can then utilize shared memory _inside_ the container. 
This is the cleanest solution (i.e., minimizes container boundary violations), 
but some users need/want per-process isolation.

* run one container per MPI process, having each container then mount an 
_external_ common directory to an internal mount point. This allows each 
process to access the common shared memory location. As with the device 
drivers, you typically specify that external mount location when launching the 
container.

Using those combined methods, you can certainly have a "generic" container that 
suffers no performance impact from bare metal. The problem has been that it 
takes a certain degree of "container savvy" to set this up and make it work - 
which is beyond what most users really want to learn. I'm sure the container 
community is working on ways to reduce that burden (I'm not really plugged into 
those efforts, but others on this list might be).

Ralph


> On Jan 27, 2022, at 7:39 AM, Ralph H Castain  <mailto:r...@open-mpi.org> > wrote:
> 
>> Fair enough Ralph! I was implicitly assuming a "build once / run everywhere" 
>> use case, my bad for not making my assumption clear.
>> If the container is built to run on a specific host, there are indeed other 
>> options to achieve near native performances.
>> 
> 
> Err...that isn't actually what I meant, nor what we did. You can, in fact, 
> build a container that can "run everywhere" while still employing high-speed 
> fabric support. What you do is:
> 
> * configure OMPI with all the fabrics enabled (or at least all the ones you 
> care about)
> 
> * don't include the fabric drivers in your container. These can/will vary 
> across deployments, especially those (like NVIDIA's) that involve kernel 
> modules
> 
> * setup your container to mount specified external device driver locations 
> onto the locations where you configured OMPI to find them. Sadly, this does 
> violate the container boundary - but nobody has come up with another 
> solution, and at least the violation is confined to just the device drivers. 
> Typically, you specify the external locations that are to be mounted using an 
> envar or some other mechanism appropriate to your container, and then include 
> the relevant information when launching the containers.

> 
> When OMPI initializes, it will do its normal procedure of attempting to load 
> each fabric's drivers, selecting the transports whose drivers it can load. 
> NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to build 
> without statically linking in the fabric plugins or else this probably will 
> fail.
> 
> At least one vendor now distributes OMPI containers preconfigured with their 
> fabric support based on this method. So using a "generic" container doesn't 
> mean you lose performance - in fact, our tests showed zero impact on 
> performance using this method.
> 
> HTH
> Ralph
> 





Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Ralph Castain via users
Just to complete this - there is always a lingering question regarding shared 
memory support. There are two ways to resolve that one:

* run one container per physical node, launching multiple procs in each 
container. The procs can then utilize shared memory _inside_ the container. 
This is the cleanest solution (i.e., minimizes container boundary violations), 
but some users need/want per-process isolation.

* run one container per MPI process, having each container then mount an 
_external_ common directory to an internal mount point. This allows each 
process to access the common shared memory location. As with the device 
drivers, you typically specify that external mount location when launching the 
container.

Using those combined methods, you can certainly have a "generic" container that 
suffers no performance impact from bare metal. The problem has been that it 
takes a certain degree of "container savvy" to set this up and make it work - 
which is beyond what most users really want to learn. I'm sure the container 
community is working on ways to reduce that burden (I'm not really plugged into 
those efforts, but others on this list might be).

Ralph


> On Jan 27, 2022, at 7:39 AM, Ralph H Castain  wrote:
> 
>> Fair enough Ralph! I was implicitly assuming a "build once / run everywhere" 
>> use case, my bad for not making my assumption clear.
>> If the container is built to run on a specific host, there are indeed other 
>> options to achieve near native performances.
>> 
> 
> Err...that isn't actually what I meant, nor what we did. You can, in fact, 
> build a container that can "run everywhere" while still employing high-speed 
> fabric support. What you do is:
> 
> * configure OMPI with all the fabrics enabled (or at least all the ones you 
> care about)
> 
> * don't include the fabric drivers in your container. These can/will vary 
> across deployments, especially those (like NVIDIA's) that involve kernel 
> modules
> 
> * setup your container to mount specified external device driver locations 
> onto the locations where you configured OMPI to find them. Sadly, this does 
> violate the container boundary - but nobody has come up with another 
> solution, and at least the violation is confined to just the device drivers. 
> Typically, you specify the external locations that are to be mounted using an 
> envar or some other mechanism appropriate to your container, and then include 
> the relevant information when launching the containers.
> 
> When OMPI initializes, it will do its normal procedure of attempting to load 
> each fabric's drivers, selecting the transports whose drivers it can load. 
> NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to build 
> without statically linking in the fabric plugins or else this probably will 
> fail.
> 
> At least one vendor now distributes OMPI containers preconfigured with their 
> fabric support based on this method. So using a "generic" container doesn't 
> mean you lose performance - in fact, our tests showed zero impact on 
> performance using this method.
> 
> HTH
> Ralph
> 




Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Ralph Castain via users
> Fair enough Ralph! I was implicitly assuming a "build once / run everywhere" 
> use case, my bad for not making my assumption clear.
> If the container is built to run on a specific host, there are indeed other 
> options to achieve near native performances.
> 

Err...that isn't actually what I meant, nor what we did. You can, in fact, 
build a container that can "run everywhere" while still employing high-speed 
fabric support. What you do is:

* configure OMPI with all the fabrics enabled (or at least all the ones you 
care about)

* don't include the fabric drivers in your container. These can/will vary 
across deployments, especially those (like NVIDIA's) that involve kernel modules

* setup your container to mount specified external device driver locations onto 
the locations where you configured OMPI to find them. Sadly, this does violate 
the container boundary - but nobody has come up with another solution, and at 
least the violation is confined to just the device drivers. Typically, you 
specify the external locations that are to be mounted using an envar or some 
other mechanism appropriate to your container, and then include the relevant 
information when launching the containers.

When OMPI initializes, it will do its normal procedure of attempting to load 
each fabric's drivers, selecting the transports whose drivers it can load. 
NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to build 
without statically linking in the fabric plugins or else this probably will 
fail.

At least one vendor now distributes OMPI containers preconfigured with their 
fabric support based on this method. So using a "generic" container doesn't 
mean you lose performance - in fact, our tests showed zero impact on 
performance using this method.

HTH
Ralph




Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-26 Thread Ralph Castain via users
I'll disagree a bit there. You do want to use an MPI library in your container 
that is configued to perform on the host cluster. However, that doesn't mean 
you are constrained as Gilles describes. It takes a little more setup 
knowledge, true, but there are lots of instructions and knowledgeable people 
out there to help. Experiments have shown that using non-system MPIs provide at 
least equivalent performance to the native MPIs when configured. Matching the 
internal/external MPI implementations may simplify the mechanics of setting it 
up, but it is definitely not required.


On Jan 26, 2022, at 8:55 PM, Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org> > wrote:

Brian,

FWIW

Keep in mind that when running a container on a supercomputer, it is generally 
recommended to use the supercomputer MPI implementation
(fine tuned and with support for the high speed interconnect) instead of the 
one of the container (generally a vanilla MPI with basic
support for TCP and shared memory).
That scenario implies several additional constraints, and one of them is the 
MPI library of the host and the container are (oversimplified) ABI compatible.

In your case, you would have to rebuild your container with MPICH (instead of 
Open MPI) so it can be "substituted" at run time with Intel MPI (MPICH based 
and ABI compatible).

Cheers,

Gilles

On Thu, Jan 27, 2022 at 1:07 PM Brian Dobbins via users 
mailto:users@lists.open-mpi.org> > wrote:

Hi Ralph,

  Thanks for the explanation - in hindsight, that makes perfect sense, since 
each process is operating inside the container and will of course load up 
identical libraries, so data types/sizes can't be inconsistent.  I don't know 
why I didn't realize that before.  I imagine the past issues I'd experienced 
were just due to the PMI differences in the different MPI implementations at 
the time.  I owe you a beer or something at the next in-person SC conference!

  Cheers,
  - Brian


On Wed, Jan 26, 2022 at 4:54 PM Ralph Castain via users 
mailto:users@lists.open-mpi.org> > wrote:
There is indeed an ABI difference. However, the _launcher_ doesn't have 
anything to do with the MPI library. All that is needed is a launcher that can 
provide the key exchange required to wireup the MPI processes. At this point, 
both MPICH and OMPI have PMIx support, so you can use the same launcher for 
both. IMPI does not, and so the IMPI launcher will only support PMI-1 or PMI-2 
(I forget which one).

You can, however, work around that problem. For example, if the host system is 
using Slurm, then you could "srun" the containers and let Slurm perform the 
wireup. Again, you'd have to ensure that OMPI was built to support whatever 
wireup protocol the Slurm installation supported (which might well be PMIx 
today). Also works on Cray/ALPS. Completely bypasses the IMPI issue.

Another option I've seen used is to have the host system start the containers 
(using ssh or whatever), providing the containers with access to a "hostfile" 
identifying the TCP address of each container. It is then easy for OMPI's 
mpirun to launch the job across the containers. I use this every day on my 
machine (using Docker Desktop with Docker containers, but the container tech is 
irrelevant here) to test OMPI. Pretty easy to set that up, and I should think 
the sys admins could do so for their users.

Finally, you could always install the PMIx Reference RTE (PRRTE) on the cluster 
as that executes at user level, and then use PRRTE to launch your OMPI 
containers. OMPI runs very well under PRRTE - in fact, PRRTE is the RTE 
embedded in OMPI starting with the v5.0 release.

Regardless of your choice of method, the presence of IMPI doesn't preclude 
using OMPI containers so long as the OMPI library is fully contained in that 
container. Choice of launch method just depends on how your system is setup.

Ralph


On Jan 26, 2022, at 3:17 PM, Brian Dobbins mailto:bdobb...@gmail.com> > wrote:


Hi Ralph,

Afraid I don't understand. If your image has the OMPI libraries installed in 
it, what difference does it make what is on your host? You'll never see the 
IMPI installation.

We have been supporting people running that way since Singularity was 
originally released, without any problems. The only time you can hit an issue 
is if you try to mount the MPI libraries from the host (i.e., violate the 
container boundary) - so don't do that and you should be fine.

  Can you clarify what you mean here?  I thought there was an ABI difference 
between the various MPICH-based MPIs and OpenMPI, meaning you can't use a 
host's Intel MPI to launch a container's OpenMPI-compiled program.  You can use 
the internal-to-the-container OpenMPI to launch everything, which is easy for 
single-node runs but more challenging for multi-node ones.  Maybe my 
understanding is wrong or out 

Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-26 Thread Ralph Castain via users
There is indeed an ABI difference. However, the _launcher_ doesn't have 
anything to do with the MPI library. All that is needed is a launcher that can 
provide the key exchange required to wireup the MPI processes. At this point, 
both MPICH and OMPI have PMIx support, so you can use the same launcher for 
both. IMPI does not, and so the IMPI launcher will only support PMI-1 or PMI-2 
(I forget which one).

You can, however, work around that problem. For example, if the host system is 
using Slurm, then you could "srun" the containers and let Slurm perform the 
wireup. Again, you'd have to ensure that OMPI was built to support whatever 
wireup protocol the Slurm installation supported (which might well be PMIx 
today). Also works on Cray/ALPS. Completely bypasses the IMPI issue.

Another option I've seen used is to have the host system start the containers 
(using ssh or whatever), providing the containers with access to a "hostfile" 
identifying the TCP address of each container. It is then easy for OMPI's 
mpirun to launch the job across the containers. I use this every day on my 
machine (using Docker Desktop with Docker containers, but the container tech is 
irrelevant here) to test OMPI. Pretty easy to set that up, and I should think 
the sys admins could do so for their users.

Finally, you could always install the PMIx Reference RTE (PRRTE) on the cluster 
as that executes at user level, and then use PRRTE to launch your OMPI 
containers. OMPI runs very well under PRRTE - in fact, PRRTE is the RTE 
embedded in OMPI starting with the v5.0 release.

Regardless of your choice of method, the presence of IMPI doesn't preclude 
using OMPI containers so long as the OMPI library is fully contained in that 
container. Choice of launch method just depends on how your system is setup.

Ralph


On Jan 26, 2022, at 3:17 PM, Brian Dobbins mailto:bdobb...@gmail.com> > wrote:


Hi Ralph,

Afraid I don't understand. If your image has the OMPI libraries installed in 
it, what difference does it make what is on your host? You'll never see the 
IMPI installation.

We have been supporting people running that way since Singularity was 
originally released, without any problems. The only time you can hit an issue 
is if you try to mount the MPI libraries from the host (i.e., violate the 
container boundary) - so don't do that and you should be fine.

  Can you clarify what you mean here?  I thought there was an ABI difference 
between the various MPICH-based MPIs and OpenMPI, meaning you can't use a 
host's Intel MPI to launch a container's OpenMPI-compiled program.  You can use 
the internal-to-the-container OpenMPI to launch everything, which is easy for 
single-node runs but more challenging for multi-node ones.  Maybe my 
understanding is wrong or out of date though?

  Thanks,
  - Brian

 

On Jan 26, 2022, at 12:19 PM, Luis Alfredo Pires Barbosa 
mailto:luis_pire...@hotmail.com> > wrote:

Hi Ralph,
 My singularity image has OpenMPI, but my host doesnt (Intel MPI). And I am not 
sure if I the system would work with Intel + OpenMPI.
 Luis
 Enviado do Email <https://go.microsoft.com/fwlink/?LinkId=550986>  para Windows
 De: Ralph Castain via users <mailto:users@lists.open-mpi.org> 
Enviado:quarta-feira, 26 de janeiro de 2022 16:01
Para: Open MPI Users <mailto:users@lists.open-mpi.org> 
Cc:Ralph Castain <mailto:r...@open-mpi.org> 
Assunto: Re: [OMPI users] OpenMPI - Intel MPI
 Err...the whole point of a container is to put all the library dependencies 
_inside_ it. So why don't you just install OMPI in your singularity image?
 

On Jan 26, 2022, at 6:42 AM, Luis Alfredo Pires Barbosa via users 
mailto:users@lists.open-mpi.org> > wrote:
 Hello all,
 I have Intel MPI in my cluster but I am running singularity image of a 
software which uses OpenMPI.
 Since they may not be compatible and I dont think it is possible to get these 
two different MPI running in the system.
I wounder if there is some work arround for this issue.
 Any insight would be welcome.
Luis




Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-26 Thread Ralph Castain via users
Afraid I don't understand. If your image has the OMPI libraries installed in 
it, what difference does it make what is on your host? You'll never see the 
IMPI installation.

We have been supporting people running that way since Singularity was 
originally released, without any problems. The only time you can hit an issue 
is if you try to mount the MPI libraries from the host (i.e., violate the 
container boundary) - so don't do that and you should be fine.


On Jan 26, 2022, at 12:19 PM, Luis Alfredo Pires Barbosa 
mailto:luis_pire...@hotmail.com> > wrote:

Hi Ralph,
 My singularity image has OpenMPI, but my host doesnt (Intel MPI). And I am not 
sure if I the system would work with Intel + OpenMPI.
 Luis
 Enviado do Email <https://go.microsoft.com/fwlink/?LinkId=550986>  para Windows
 De: Ralph Castain via users <mailto:users@lists.open-mpi.org> 
Enviado:quarta-feira, 26 de janeiro de 2022 16:01
Para: Open MPI Users <mailto:users@lists.open-mpi.org> 
Cc:Ralph Castain <mailto:r...@open-mpi.org> 
Assunto: Re: [OMPI users] OpenMPI - Intel MPI
 Err...the whole point of a container is to put all the library dependencies 
_inside_ it. So why don't you just install OMPI in your singularity image?
 

On Jan 26, 2022, at 6:42 AM, Luis Alfredo Pires Barbosa via users 
mailto:users@lists.open-mpi.org> > wrote:
 Hello all,
 I have Intel MPI in my cluster but I am running singularity image of a 
software which uses OpenMPI.
 Since they may not be compatible and I dont think it is possible to get these 
two different MPI running in the system.
I wounder if there is some work arround for this issue.
 Any insight would be welcome.
Luis



Re: [OMPI users] OpenMPI - Intel MPI

2022-01-26 Thread Ralph Castain via users
Err...the whole point of a container is to put all the library dependencies 
_inside_ it. So why don't you just install OMPI in your singularity image?


On Jan 26, 2022, at 6:42 AM, Luis Alfredo Pires Barbosa via users 
mailto:users@lists.open-mpi.org> > wrote:

Hello all,

I have Intel MPI in my cluster but I am running singularity image of a software 
which uses OpenMPI.

Since they may not be compatible and I dont think it is possible to get these 
two different MPI running in the system.
I wounder if there is some work arround for this issue.

Any insight would be welcome.
Luis



Re: [OMPI users] Creating An MPI Job from Procs Launched by a Different Launcher

2022-01-25 Thread Ralph Castain via users
Short answer is yes, but it it a bit complicated to do.

On Jan 25, 2022, at 12:28 PM, Saliya Ekanayake via users 
mailto:users@lists.open-mpi.org> > wrote:

Hi,

I am trying to run an MPI program on a platform that launches the processes 
using a custom launcher (not mpiexec). This will end up spawning N processes of 
the program, but I am not sure if MPI_Init() would work or not in this case?

Is it possible to have a group of processes launched by some other means to be 
tied into an MPI communicator?

Thank you,
Saliyaf





Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-25 Thread Ralph Castain via users
Never seen anything like that before - am I reading those errors correctly that 
it cannot find the "write" function symbol in libc?? Frankly, if that's true 
then it sounds like something is borked in the system.


> On Jan 25, 2022, at 8:26 AM, Matthias Leopold via users 
>  wrote:
> 
> just in case anyone wants to do more debugging: I ran "srun --mpi=pmix" now 
> with "LD_DEBUG=all", the lines preceding the error are
> 
>   1263345:symbol=write;  lookup in 
> file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
> 
>   1263345:binding file 
> /msc/sw/hpc-sdk/Linux_x86_64/21.9/comm_libs/mpi/lib/libopen-pal.so.40 [0] to 
> /lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `write' [GLIBC_2.2.5]
> 
> [foo:1263345] OPAL ERROR: Error in file 
> ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
> 
> 
> again: PMIx library version used by SLURM is 3.2.3
> 
> thx
> Matthias
> 
> Am 25.01.22 um 11:04 schrieb Gilles Gouaillardet:
>> Matthias,
>> Thanks for the clarifications.
>> Unfortunately, I cannot connect the dots and I must be missing something.
>> If I recap correctly:
>>  - SLURM has builtin PMIx support
>>  - Open MPI has builtin PMIx support
>>  - srun explicitly requires PMIx (srun --mpi=pmix_v3 ...)
>>  - and yet Open MPI issues an error message stating missing support for PMI 
>> (aka SLURM provided PMI1/PMI2)
>> So it seems Open PMI builtin PMIx client is unable to find/communicate with 
>> SLURM PMIx server
>> PMIx has cross version compatibility (e.g. client and server can have some 
>> different versions), but with some restrictions
>> Could this be the root cause?
>> What is the PMIx library version used by SLURM?
>> Ralph, do you see something wrong on why Open MPI and SLURM cannot 
>> communicate via PMIx?
>> Cheers,
>> Gilles
>> On Tue, Jan 25, 2022 at 5:47 PM Matthias Leopold 
>> > <mailto:matthias.leop...@meduniwien.ac.at>> wrote:
>>Hi Gilles,
>>I'm indeed using srun, I didn't have luck using mpirun yet.
>>Are option 2 + 3 of your list really different things? As far as I
>>understood now I need "Open MPI with PMI support", THEN I can use srun
>>with PMIx. Right now using "srun --mpi=pmix(_v3)" gives the error
>>mentioned below.
>>Best,
>>Matthias
>>Am 25.01.22 um 07:17 schrieb Gilles Gouaillardet via users:
>> > Matthias,
>> >
>> > do you run the MPI application with mpirun or srun?
>> >
>> > The error log suggests you are using srun, and SLURM only
>>provides only
>> > PMI support.
>> > If this is the case, then you have three options:
>> >   - use mpirun
>> >   - rebuild Open MPI with PMI support as Ralph previously explained
>> >   - use SLURM PMIx:
>> >  srun --mpi=list
>> >  will list the PMI flavors provided by SLURM
>> > a) if PMIx is not supported, contact your sysadmin and ask for it
>> > b) if PMIx is supported but is not the default, ask for it, for
>> > example with
>> > srun --mpi=pmix_v3 ...
>> >
>> > Cheers,
>> >
>> > Gilles
>> >
>> > On Tue, Jan 25, 2022 at 12:30 AM Ralph Castain via users
>> > mailto:users@lists.open-mpi.org>
>><mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>>
>>wrote:
>> >
>> > You should probably ask them - I see in the top one that they
>>used a
>> > platform file, which likely had the missing option in it. The
>>bottom
>> > one does not use that platform file, so it was probably missed.
>> >
>> >
>> >  > On Jan 24, 2022, at 7:17 AM, Matthias Leopold via users
>> > mailto:users@lists.open-mpi.org>
>><mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>>
>>wrote:
>> >  >
>> >  > To be sure: both packages were provided by NVIDIA (I didn't
>> > compile them)
>> >  >
>> >  > Am 24.01.22 um 16:13 schrieb Matthias Leopold:
>> >  >> Thx, but I don't see this option in any of the two versions:
>> >  >> /usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info (works with
>>slurm):
>> >  >>   Configure command line: '--build=x86_6

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-24 Thread Ralph Castain via users
You should probably ask them - I see in the top one that they used a platform 
file, which likely had the missing option in it. The bottom one does not use 
that platform file, so it was probably missed.


> On Jan 24, 2022, at 7:17 AM, Matthias Leopold via users 
>  wrote:
> 
> To be sure: both packages were provided by NVIDIA (I didn't compile them)
> 
> Am 24.01.22 um 16:13 schrieb Matthias Leopold:
>> Thx, but I don't see this option in any of the two versions:
>> /usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info (works with slurm):
>>   Configure command line: '--build=x86_64-linux-gnu' '--prefix=/usr' 
>> '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' 
>> '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' 
>> '--disable-silent-rules' '--libexecdir=${prefix}/lib/openmpi' 
>> '--disable-maintainer-mode' '--disable-dependency-tracking' 
>> '--prefix=/usr/mpi/gcc/openmpi-4.1.2a1' 
>> '--with-platform=contrib/platform/mellanox/optimized'
>> lmod ompi (doesn't work with slurm)
>>   Configure command line: 
>> '--prefix=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1' 
>> 'CC=nvc -nomp' 'CXX=nvc++ -nomp' 'FC=nvfortran -nomp' 'CFLAGS=-O1 -fPIC -c99 
>> -tp p7-64' 'CXXFLAGS=-O1 -fPIC -tp p7-64' 'FCFLAGS=-O1 -fPIC -tp p7-64' 
>> 'LD=ld' '--enable-shared' '--enable-static' '--without-tm' 
>> '--enable-mpi-cxx' '--disable-wrapper-runpath' 
>> '--enable-mpirun-prefix-by-default' '--with-libevent=internal' 
>> '--with-slurm' '--without-libnl' '--enable-mpi1-compatibility' 
>> '--enable-mca-no-build=btl-uct' '--without-verbs' 
>> '--with-cuda=/proj/cuda/11.0/Linux_x86_64' 
>> '--with-ucx=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1' 
>> Matthias
>> Am 24.01.22 um 15:59 schrieb Ralph Castain via users:
>>> If you look at your configure line, you forgot to include 
>>> --with-pmi=. We don't build the Slurm PMI support by 
>>> default due to the GPL licensing issues - you have to point at it.
>>> 
>>> 
>>>> On Jan 24, 2022, at 6:41 AM, Matthias Leopold via users 
>>>>  wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> we have 2 DGX A100 machines and I'm trying to run nccl-tests 
>>>> (https://github.com/NVIDIA/nccl-tests) in various ways to understand how 
>>>> things work.
>>>> 
>>>> I can successfully run nccl-tests on both nodes with Slurm (via srun) when 
>>>> built directly on a compute node against Open MPI 4.1.2 coming from a 
>>>> NVIDIA deb package.
>>>> 
>>>> I can also build nccl-tests in a lmod environment with NVIDIA HPC SDK 
>>>> 21.09 with Open MPI 4.0.5. When I run this with Slurm (via srun) I get the 
>>>> following message:
>>>> 
>>>> [foo:1140698] OPAL ERROR: Error in file 
>>>> ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
>>>> 
>>>> -- 
>>>> 
>>>> The application appears to have been direct launched using "srun",
>>>> 
>>>> but OMPI was not built with SLURM's PMI support and therefore cannot
>>>> 
>>>> execute. There are several options for building PMI support under
>>>> 
>>>> SLURM, depending upon the SLURM version you are using:
>>>> 
>>>> 
>>>> 
>>>>   version 16.05 or later: you can use SLURM's PMIx support. This
>>>> 
>>>>   requires that you configure and build SLURM --with-pmix.
>>>> 
>>>> 
>>>> 
>>>>   Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>>>> 
>>>>   PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>>>> 
>>>>   install PMI-2. You must then build Open MPI using --with-pmi pointing
>>>> 
>>>>   to the SLURM PMI library location.
>>>> 
>>>> 
>>>> 
>>>> Please configure as appropriate and try again.
>>>> 
>>>> -- 
>>>> 
>>>&

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-24 Thread Ralph Castain via users
If you look at your configure line, you forgot to include 
--with-pmi=. We don't build the Slurm PMI support by 
default due to the GPL licensing issues - you have to point at it.


> On Jan 24, 2022, at 6:41 AM, Matthias Leopold via users 
>  wrote:
> 
> Hi,
> 
> we have 2 DGX A100 machines and I'm trying to run nccl-tests 
> (https://github.com/NVIDIA/nccl-tests) in various ways to understand how 
> things work.
> 
> I can successfully run nccl-tests on both nodes with Slurm (via srun) when 
> built directly on a compute node against Open MPI 4.1.2 coming from a NVIDIA 
> deb package.
> 
> I can also build nccl-tests in a lmod environment with NVIDIA HPC SDK 21.09 
> with Open MPI 4.0.5. When I run this with Slurm (via srun) I get the 
> following message:
> 
> [foo:1140698] OPAL ERROR: Error in file 
> ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
> 
> --
> 
> The application appears to have been direct launched using "srun",
> 
> but OMPI was not built with SLURM's PMI support and therefore cannot
> 
> execute. There are several options for building PMI support under
> 
> SLURM, depending upon the SLURM version you are using:
> 
> 
> 
>  version 16.05 or later: you can use SLURM's PMIx support. This
> 
>  requires that you configure and build SLURM --with-pmix.
> 
> 
> 
>  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
> 
>  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
> 
>  install PMI-2. You must then build Open MPI using --with-pmi pointing
> 
>  to the SLURM PMI library location.
> 
> 
> 
> Please configure as appropriate and try again.
> 
> --
> 
> *** An error occurred in MPI_Init
> 
> *** on a NULL communicator
> 
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> 
> ***and potentially your MPI job)
> 
> 
> 
> When I look at PMI support in both Open MPI packages I don't see a lot of 
> difference:
> 
> “/usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info --parsable | grep -i pmi”:
> 
> mca:pmix:isolated:version:“mca:2.1.0”
> mca:pmix:isolated:version:“api:2.0.0”
> mca:pmix:isolated:version:“component:4.1.2”
> mca:pmix:flux:version:“mca:2.1.0”
> mca:pmix:flux:version:“api:2.0.0”
> mca:pmix:flux:version:“component:4.1.2”
> mca:pmix:pmix3x:version:“mca:2.1.0”
> mca:pmix:pmix3x:version:“api:2.0.0”
> mca:pmix:pmix3x:version:“component:4.1.2”
> mca:ess:pmi:version:“mca:2.1.0”
> mca:ess:pmi:version:“api:3.0.0”
> mca:ess:pmi:version:“component:4.1.2”
> 
> “/msc/sw/hpc-sdk/Linux_x86_64/21.9/comm_libs/mpi/bin/ompi_info --parsable | 
> grep -i pmi”:
> 
> mca:pmix:isolated:version:“mca:2.1.0”
> mca:pmix:isolated:version:“api:2.0.0”
> mca:pmix:isolated:version:“component:4.0.5”
> mca:pmix:pmix3x:version:“mca:2.1.0”
> mca:pmix:pmix3x:version:“api:2.0.0”
> mca:pmix:pmix3x:version:“component:4.0.5”
> mca:ess:pmi:version:“mca:2.1.0”
> mca:ess:pmi:version:“api:3.0.0”
> mca:ess:pmi:version:“component:4.0.5”
> 
> I don't know if that's the right place I'm looking at, but to me it seems 
> it's an Open MPI topic, this is why I'm posting here. Please explain what's 
> missing in my case.
> 
> Slurm is 21.08.5. "MpiDefault" in slurm.conf is "pmix".
> Both Open MPI versions have Slurm support.
> 
> thx
> Matthias




Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
Yeah, I'm surprised by that - they used to build --with-tm as it only activates 
if/when it finds itself in the appropriate environment. No harm in building the 
support, so the distros always built with all the RM components. No idea why 
this happened - you might mention it to them as I suspect it was an 
error/oversight.



> On Jan 18, 2022, at 3:05 PM, Crni Gorac via users  
> wrote:
> 
> Indeed, I realized in the meantime that changing the hostfile to:
> 
> node1 slots=1
> node2 slots=1
> 
> works as I expected.
> 
> Thanks once again for the clarification, got it now.  I'll see if we
> can live this way (the job submission scripts are mostly automatically
> generated from an auxiliary, site specific, shell script, and I can
> change this one to simply add "slots=1" to the hostfile generated by
> PBS, before passing it to mpirun), but it's pity that tm support is
> not included in these pre-built OpenMPI installations.
> 
> On Tue, Jan 18, 2022 at 11:56 PM Ralph Castain via users
>  wrote:
>> 
>> Hostfile isn't being ignored - it is doing precisely what it is supposed to 
>> do (and is documented to do). The problem is that without tm support, we 
>> don't read the external allocation. So we use hostfile to identify the 
>> hosts, and then we discover the #slots on each host as being the #cores on 
>> that node.
>> 
>> In contrast, the -host option is doing what it is supposed to do - it 
>> assigns one slot for each mention of the hostname. You can increase the slot 
>> allocation using the colon qualifier - i.e., "-host node1:5" assigns 5 slots 
>> to node1.
>> 
>> If tm support is included, then we read the PBS allocation and see one slot 
>> on each node - and launch accordingly.
>> 
>> 
>>> On Jan 18, 2022, at 2:44 PM, Crni Gorac via users 
>>>  wrote:
>>> 
>>> OK, just checked and you're right: both processes get run on the first
>>> node.  So it seems that the "hostfile" option in mpirun, that in my
>>> case refers to a file properly listing two nodes, like:
>>> 
>>> node1
>>> node2
>>> 
>>> is ignored.
>>> 
>>> I also tried logging in to node1, and launching using mpirun directly,
>>> without PBS, and the same thing happens.  However, if I specify "host"
>>> options instead, then ranks get started on different nodes, and it all
>>> works properly.  Then I tried the same from within the PBS script, and
>>> it worked.
>>> 
>>> Thus, to summarize, instead of:
>>> mpirun -n 2 -hostfile $PBS_NODEFILE ./foo
>>> one should use:
>>> mpirun -n 2 --host node1,node2 ./foo
>>> 
>>> Rather strange, but it's important that it works somehow.  Thanks for your 
>>> help!
>>> 
>>> On Tue, Jan 18, 2022 at 10:54 PM Ralph Castain via users
>>>  wrote:
>>>> 
>>>> Are you launching the job with "mpirun"? I'm not familiar with that cmd 
>>>> line and don't know what it does.
>>>> 
>>>> Most likely explanation is that the mpirun from the prebuilt versions 
>>>> doesn't have TM support, and therefore doesn't understand the 1ppn 
>>>> directive in your cmd line. My guess is that you are using the ssh 
>>>> launcher - what is odd is that you should wind up with two procs on the 
>>>> first node, in which case those envars are correct. If you are seeing one 
>>>> proc on each node, then something is wrong.
>>>> 
>>>> 
>>>>> On Jan 18, 2022, at 1:33 PM, Crni Gorac via users 
>>>>>  wrote:
>>>>> 
>>>>> I have one process per node, here is corresponding line from my job
>>>>> submission script (with compute nodes named "node1" and "node2"):
>>>>> 
>>>>> #PBS -l 
>>>>> select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2
>>>>> 
>>>>> On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users
>>>>>  wrote:
>>>>>> 
>>>>>> Afraid I can't understand your scenario - when you say you "submit a 
>>>>>> job" to run on two nodes, how many processes are you running on each 
>>>>>> node??
>>>>>> 
>>>>>> 
>>>>>>> On Jan 18, 2022, at 1:07 PM, Crni Gorac via users 
>>>&

Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
Hostfile isn't being ignored - it is doing precisely what it is supposed to do 
(and is documented to do). The problem is that without tm support, we don't 
read the external allocation. So we use hostfile to identify the hosts, and 
then we discover the #slots on each host as being the #cores on that node.

In contrast, the -host option is doing what it is supposed to do - it assigns 
one slot for each mention of the hostname. You can increase the slot allocation 
using the colon qualifier - i.e., "-host node1:5" assigns 5 slots to node1.

If tm support is included, then we read the PBS allocation and see one slot on 
each node - and launch accordingly.


> On Jan 18, 2022, at 2:44 PM, Crni Gorac via users  
> wrote:
> 
> OK, just checked and you're right: both processes get run on the first
> node.  So it seems that the "hostfile" option in mpirun, that in my
> case refers to a file properly listing two nodes, like:
> 
> node1
> node2
> 
> is ignored.
> 
> I also tried logging in to node1, and launching using mpirun directly,
> without PBS, and the same thing happens.  However, if I specify "host"
> options instead, then ranks get started on different nodes, and it all
> works properly.  Then I tried the same from within the PBS script, and
> it worked.
> 
> Thus, to summarize, instead of:
> mpirun -n 2 -hostfile $PBS_NODEFILE ./foo
> one should use:
> mpirun -n 2 --host node1,node2 ./foo
> 
> Rather strange, but it's important that it works somehow.  Thanks for your 
> help!
> 
> On Tue, Jan 18, 2022 at 10:54 PM Ralph Castain via users
>  wrote:
>> 
>> Are you launching the job with "mpirun"? I'm not familiar with that cmd line 
>> and don't know what it does.
>> 
>> Most likely explanation is that the mpirun from the prebuilt versions 
>> doesn't have TM support, and therefore doesn't understand the 1ppn directive 
>> in your cmd line. My guess is that you are using the ssh launcher - what is 
>> odd is that you should wind up with two procs on the first node, in which 
>> case those envars are correct. If you are seeing one proc on each node, then 
>> something is wrong.
>> 
>> 
>>> On Jan 18, 2022, at 1:33 PM, Crni Gorac via users 
>>>  wrote:
>>> 
>>> I have one process per node, here is corresponding line from my job
>>> submission script (with compute nodes named "node1" and "node2"):
>>> 
>>> #PBS -l 
>>> select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2
>>> 
>>> On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users
>>>  wrote:
>>>> 
>>>> Afraid I can't understand your scenario - when you say you "submit a job" 
>>>> to run on two nodes, how many processes are you running on each node??
>>>> 
>>>> 
>>>>> On Jan 18, 2022, at 1:07 PM, Crni Gorac via users 
>>>>>  wrote:
>>>>> 
>>>>> Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
>>>>> have PBS 18.1.4 installed on my cluster (cluster nodes are running
>>>>> CentOS 7.9).  When I try to submit a job that will run on two nodes in
>>>>> the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
>>>>> instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1,
>>>>> instead of both being 0.  At the same time, the hostfile generated by
>>>>> PBS ($PBS_NODEFILE) properly contains two nodes listed.
>>>>> 
>>>>> I've tried with OpenMPI 3 from HPC-X, and the same thing happens too.
>>>>> However, when I build OpenMPI myself (notable difference from above
>>>>> mentioned pre-built MPI versions is that I use "--with-tm" option to
>>>>> point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and
>>>>> OMPI_COMM_WORLD_LOCAL_RANK are set properly.
>>>>> 
>>>>> I'm not sure how to debug the problem, and whether it is possible to
>>>>> fix it at all with a pre-built OpenMPI version, so any suggestion is
>>>>> welcome.
>>>>> 
>>>>> Thanks.
>>>> 
>>>> 
>> 
>> 




Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
Are you launching the job with "mpirun"? I'm not familiar with that cmd line 
and don't know what it does.

Most likely explanation is that the mpirun from the prebuilt versions doesn't 
have TM support, and therefore doesn't understand the 1ppn directive in your 
cmd line. My guess is that you are using the ssh launcher - what is odd is that 
you should wind up with two procs on the first node, in which case those envars 
are correct. If you are seeing one proc on each node, then something is wrong.


> On Jan 18, 2022, at 1:33 PM, Crni Gorac via users  
> wrote:
> 
> I have one process per node, here is corresponding line from my job
> submission script (with compute nodes named "node1" and "node2"):
> 
> #PBS -l select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2
> 
> On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users
>  wrote:
>> 
>> Afraid I can't understand your scenario - when you say you "submit a job" to 
>> run on two nodes, how many processes are you running on each node??
>> 
>> 
>>> On Jan 18, 2022, at 1:07 PM, Crni Gorac via users 
>>>  wrote:
>>> 
>>> Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
>>> have PBS 18.1.4 installed on my cluster (cluster nodes are running
>>> CentOS 7.9).  When I try to submit a job that will run on two nodes in
>>> the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
>>> instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1,
>>> instead of both being 0.  At the same time, the hostfile generated by
>>> PBS ($PBS_NODEFILE) properly contains two nodes listed.
>>> 
>>> I've tried with OpenMPI 3 from HPC-X, and the same thing happens too.
>>> However, when I build OpenMPI myself (notable difference from above
>>> mentioned pre-built MPI versions is that I use "--with-tm" option to
>>> point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and
>>> OMPI_COMM_WORLD_LOCAL_RANK are set properly.
>>> 
>>> I'm not sure how to debug the problem, and whether it is possible to
>>> fix it at all with a pre-built OpenMPI version, so any suggestion is
>>> welcome.
>>> 
>>> Thanks.
>> 
>> 




Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
Afraid I can't understand your scenario - when you say you "submit a job" to 
run on two nodes, how many processes are you running on each node??


> On Jan 18, 2022, at 1:07 PM, Crni Gorac via users  
> wrote:
> 
> Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
> have PBS 18.1.4 installed on my cluster (cluster nodes are running
> CentOS 7.9).  When I try to submit a job that will run on two nodes in
> the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
> instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1,
> instead of both being 0.  At the same time, the hostfile generated by
> PBS ($PBS_NODEFILE) properly contains two nodes listed.
> 
> I've tried with OpenMPI 3 from HPC-X, and the same thing happens too.
> However, when I build OpenMPI myself (notable difference from above
> mentioned pre-built MPI versions is that I use "--with-tm" option to
> point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and
> OMPI_COMM_WORLD_LOCAL_RANK are set properly.
> 
> I'm not sure how to debug the problem, and whether it is possible to
> fix it at all with a pre-built OpenMPI version, so any suggestion is
> welcome.
> 
> Thanks.




Re: [OMPI users] stdout scrambled in file

2021-12-18 Thread Ralph Castain via users
FWIW: this has been "fixed" in PMIx/PRRTE and should make it into OMPI v5 if 
the OMPI community accepts it. The default behavior has been changed to output 
a full line-at-a-time so that the output from different ranks doesn't get mixed 
together. The negative to this, of course, is that we now internally buffer 
output until we see a newline character or the process terminates.

Since some applications really do need immediate output (and/or may not have 
newline characters in the output), I added a "raw" option to the "output" 
directive that matches the old behavior - i.e., any output from a proc is 
immediately staged for writing out regardless of whether or not it has a 
newline.

Ralph


> On Dec 7, 2021, at 6:05 AM, Jeff Squyres (jsquyres) via users 
>  wrote:
> 
> Open MPI launches a single "helper" process on each node (in Open MPI <= 
> v4.x, that helper process is called "orted").  This process is responsible 
> for launching all the individual MPI processes, and it's also responsible for 
> capturing all the stdout/stderr from those processes and sending it back to 
> mpirun via an out-of-band network message protocol (using TCP sockets).  
> mpirun accepts those network messages and emits them to mpirun's 
> stdout/stderr.
> 
> There's multiple places in that pipeline where messages can get fragmented, 
> and therefore emitted as incomplete lines (OS stdout/stderr buffering, 
> network MTU size, TCP buffering, etc.).
> 
> This is mainly because we have always assumed that stdout/stderr is not the 
> primary work output of an MPI application.  We've seen many MPI applications 
> either write their results to stable files or send the results back to a 
> single MPI process, who then gathers and emits them (i.e., if there's only 
> stdout/stderr coming from a single MPI process, the output won't get 
> interleaved with anything else).
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> 
> 
> From: users  on behalf of Fisher (US), Mark 
> S via users 
> Sent: Monday, December 6, 2021 3:45 PM
> To: Joachim Protze; Open MPI Users
> Cc: Fisher (US), Mark S
> Subject: Re: [OMPI users] stdout scrambled in file
> 
> This usually happens if we get a number of warning message from multiple 
> processes. Seems like unbuffered is what we want but not sure how this 
> interacts with MPI since stdout/stderr is pulled back from different hosts. 
> Not sure how you are doing that.
> 
> -Original Message-
> From: Joachim Protze 
> Sent: Monday, December 06, 2021 11:12 AM
> To: Fisher (US), Mark S ; Open MPI Users 
> 
> Subject: Re: [OMPI users] stdout scrambled in file
> 
> I would assume, that the buffering mode is compiler/runtime specific. At
> least for Intel compiler, the default seems to be/have been unbuffered
> for stdout, but there is a flag for buffered output:
> 
> https://community.intel.com/t5/Intel-Fortran-Compiler/Enabling-buffered-I-O-to-stdout-with-Intel-ifort-compiler/td-p/993203
> 
> In the worst case, each character might be written individually. If the
> scrambling only happens from time to time, I guess you really just see
> the buffer flush when the buffer filled up.
> 
> - Joachim
> 
> Am 06.12.21 um 16:42 schrieb Fisher (US), Mark S:
>> All strings are writing as one output so that is not the issue. Adding in 
>> some flushing is a good idea and we can try that. We do not open stdout just 
>> write to unit 6, but we could open it if there is some un-buffered option 
>> that could help. I will look into that also.  Thanks!
>> 
>> -Original Message-
>> From: Joachim Protze 
>> Sent: Monday, December 6, 2021 9:24 AM
>> To: Open MPI Users 
>> Cc: Fisher (US), Mark S 
>> Subject: Re: [OMPI users] stdout scrambled in file
>> 
>> Hi Mark,
>> 
>> "[...] MPI makes neither requirements nor recommendations for the output
>> [...]" (MPI4.0, §2.9.1)
>> 
>>  From my experience, an application can avoid such scrambling (still no
>> guarantee), if the output of lines is written atomically. C++ streams
>> are worst for concurrent output, as every stream operator writes a
>> chunk. It can help to collect output into a stringstream and print out
>> at once. Using printf in C is typically least problematic. Flushing the
>> buffer (fflush) helpts to avoid that the output buffer fills up and is
>> flushed in the middle of printing.
>> 
>> I'm not the Fortran expert. But, I think there are some options to
>> change to a buffered output mode (at least I found such options for file
>> I/O). Again, the goal should be that a write statement is printed at
>> once and the buffer doesn't fill up while printing.
>> 
>> In any case, it could help to write warnings to stderr and separate the
>> stdout and stderr streams.
>> 
>> Best
>> Joachim
>> 
>> Am 02.12.21 um 16:48 schrieb Fisher (US), Mark S via users:
>>> We are using Mellanox HPC-X MPI based on OpenMPI 4.1.1RC1 and having
>>> issues with lines scrambling together occasionally. This causes issues
>>> our converg

Re: [OMPI users] stdout scrambled in file

2021-12-05 Thread Ralph Castain via users
There are several output-controlling options - e.g., you could redirect the 
output from each process to its own file or directory.

However, it makes little sense to me for someone to write convergence data into 
a file and then parse it. Typically, convergence data results from all procs 
reaching the end of a computational epoch or cycle - i.e., you need all the 
procs to reach the same point. So why not just have the procs report their 
convergence data to rank=0 using an MPI_Gather collective, and then have that 
proc output whatever info you want to see?

You would then no longer be dependent on some implementation-specific "mpirun" 
cmd line option, so you could run the same code using srun, aprun, prun, or 
mpirun and get the exact same output.

Am I missing something?
Ralph


On Dec 5, 2021, at 12:19 PM, Gus Correa via users mailto:users@lists.open-mpi.org> > wrote:

Hi Mark

Back in the days I liked the mpirun/mpiexec --tag-output option.
Jeff: Does it still exist?
It may not prevent 100% the splitting of output lines,
but tagging the lines with the process rank helps.
You can grep the stdout log for the rank that you want,
which helps a lot when several processes are talking.

I hope this helps,
Gus Correa


On Sun, Dec 5, 2021 at 1:12 PM Jeff Squyres (jsquyres) via users 
mailto:users@lists.open-mpi.org> > wrote:
FWIW: Open MPI 4.1.2 has been released -- you can probably stop using an RC 
release.

I think you're probably running into an issue that is just a fact of life.  
Especially when there's a lot of output simultaneously from multiple MPI 
processes (potentially on different nodes), the stdout/stderr lines can just 
get munged together.

Can you check for convergence a different way?

--
Jeff Squyres
jsquy...@cisco.com  


From: users mailto:users-boun...@lists.open-mpi.org> > on behalf of Fisher (US), Mark S 
via users mailto:users@lists.open-mpi.org> >
Sent: Thursday, December 2, 2021 10:48 AM
To: users@lists.open-mpi.org  
Cc: Fisher (US), Mark S
Subject: [OMPI users] stdout scrambled in file

We are using Mellanox HPC-X MPI based on OpenMPI 4.1.1RC1 and having issues 
with lines scrambling together occasionally. This causes issues our convergence 
checking code since we put convergence data there. We are not using any mpirun 
options for stdout we just redirect stdout/stderr to a file before we run the 
mpirun command so all output goes there. We had similar issue with Intel MPI in 
the past and used the -ordered-output to fix it but I do not see any similar 
option for OpenMPI. See example below. Is there anyway to ensure a line from a 
process gets one line in the output file?


The data in red below is scrambled up and should look like the cleaned-up 
version. You can see it put a line from a different process inside a line from 
another processes and the rest of the line ended up a couple of lines down.

ZONE   0 : Min/Max CFL= 5.000E-01 1.500E+01 Min/Max DT= 8.411E-10 1.004E-01 sec

*IGSTAB* 1626 7.392E-02 2.470E-01 -9.075E-04 8.607E-03 -5.911E-04 -4.945E-06  
aerosurfs
*IGMNTAERO* 1626 -6.120E-04 1.406E-02 6.395E-04 4.473E-08 3.112E-04 -2.785E-05  
aerosurfs
*IGSTAB* 1626 7.392E-02 2.470E-01 -9.075E-04 8.607E-03 -5.911E-04 -4.945E-06  
Aircraft-Total
*IGMNTAERO* 1626 -6.120E-04 1.406E-02 6.395E-04 4.473E-08 3.112E-04 -2.785E-05  
Aircr Warning: BCFD: US_UPDATEQ: izon, iter, nBadpmin:  699  1625     12
Warning: BCFD: US_UPDATEQ: izon, iter, nBadpmin:  111  1626      6
aft-Total
*IGSTAB* 1626 6.623E-02 2.137E-01 -9.063E-04 8.450E-03 -5.485E-04 -4.961E-06  
Aircraft-OML
*IGMNTAERO* 1626 -6.118E-04 -1.602E-02 6.404E-04 5.756E-08 3.341E-04 -2.791E-05 
 Aircraft-OML


Cleaned up version:

ZONE   0 : Min/Max CFL= 5.000E-01 1.500E+01 Min/Max DT= 8.411E-10 1.004E-01 sec

*IGSTAB* 1626 7.392E-02 2.470E-01 -9.075E-04 8.607E-03 -5.911E-04 -4.945E-06  
aerosurfs
*IGMNTAERO* 1626 -6.120E-04 1.406E-02 6.395E-04 4.473E-08 3.112E-04 -2.785E-05  
aerosurfs
*IGSTAB* 1626 7.392E-02 2.470E-01 -9.075E-04 8.607E-03 -5.911E-04 -4.945E-06  
Aircraft-Total
*IGMNTAERO* 1626 -6.120E-04 1.406E-02 6.395E-04 4.473E-08 3.112E-04 -2.785E-05  
Aircraft-Total
 Warning: BCFD: US_UPDATEQ: izon, iter, nBadpmin:  699  1625     12
Warning: BCFD: US_UPDATEQ: izon, iter, nBadpmin:  111  1626      6
*IGSTAB* 1626 6.623E-02 2.137E-01 -9.063E-04 8.450E-03 -5.485E-04 -4.961E-06  
Aircraft-OML
*IGMNTAERO* 1626 -6.118E-04 -1.602E-02 6.404E-04 5.756E-08 3.341E-04 -2.791E-05 
 Aircraft-OML

Thanks!



Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn

2021-11-05 Thread Ralph Castain via users
Hmmm...yeah, there's a bug in there. I'm afraid you just need to give it a 
number for now - something large enough to meet your needs.

You could do it pretty much any way you like - what you have is fine (minus the 
host key problem) since you only specify one node. Since you are already 
telling us to spawn only one process in the MPI_Comm_spawn call itself, you 
don't need the "map-by" key at all - just tell us the host you want it on and 
we are good.

In the end, it doesn't really matter - will do the same thing.


On Nov 5, 2021, at 8:45 AM, Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov> > wrote:

Ralph,
 I changed the host name to n022:* and the problem persisted.   Here is my C++ 
code (modified slightly.  the host name is not really hard coded as it is 
below).   I thought I needed “ppr:1:node” to spawn a single process, but maybe 
that is wrong.
 char info_str[64];
    sprintf(info_str, "ppr:%d:node", 1);
    MPI_Info_create(&info);
MPI_Info_set(info, "host", “n022:*”);
    MPI_Info_set(info, "map-by", info_str);
 MPI_Comm_spawn(manager_cmd_.c_str(), argv_, 1, info, rank_, 
MPI_COMM_SELF, &intercom, error_codes);
 From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users
Sent: Friday, November 5, 2021 9:50 AM
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Ralph Castain mailto:r...@open-mpi.org> >
Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job 
launch with MPI_Comm_spawn
 Here is the problem:
 [n022.cluster.com:30045 
<https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fn022.cluster.com%3A30045%2F&data=04%7C01%7Ckurt.e.mccall%40nasa.gov%7C54a8e83c9a704aef919f08d9a06c8c17%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637717210337342010%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=RsGjFPR80WOAw7YAFqoZYWobGR1fBA7MOiapV2CK%2BJc%3D&reserved=0>
 ] [[36230,0],0] using dash_host n022
[n022.cluster.com:30045 
<https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fn022.cluster.com%3A30045%2F&data=04%7C01%7Ckurt.e.mccall%40nasa.gov%7C54a8e83c9a704aef919f08d9a06c8c17%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637717210337342010%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=RsGjFPR80WOAw7YAFqoZYWobGR1fBA7MOiapV2CK%2BJc%3D&reserved=0>
 ] [[36230,0],0] Removing node n022 slots 1 inuse 1
--
All nodes which are allocated for this job are already filled.
--
 Looks like your program is passing a "dash-host" MPI info key to the 
Comm_spawn request and listing host "n022". This translates into assigning only 
one slot to that host, which indeed has already been filled. If you want to 
tell OMPI to use that host with _all_ slots available, then you need to change 
that "dash-host" info to be "n022:*", or replace the asterisk with the number 
of procs you want to allow on that node.
 

On Nov 5, 2021, at 7:37 AM, Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov> > wrote:
 Ralph,
 I configured my build with –enable-debug and added "--mca rmaps_base_verbose 
5" to the mpiexec command line.   I have attached the job output.   Thanks for 
being willing to look at this problem.
 My complete configure command is as follows:
 $ ./configure --enable-shared --enable-static --with-tm=/opt/torque 
--enable-mpi-cxx --enable-cxx-exceptions --disable-wrapper-runpath 
--prefix=/opt/openmpi_pgc_tm  CC=nvc CXX=nvc++ FC=pgfortran CPP=cpp CFLAGS="-O0 
-tp p7-64 -c99" CXXFLAGS="-O0 -tp p7-64" FCFLAGS="-O0 -tp p7-64" --enable-debug 
--enable-memchecker --with-valgrind=/home/kmccall/valgrind_install
 The nvc++ version is “nvc++ 20.9-0 LLVM 64-bit target on x86-64 Linux -tp 
haswell".
 Our OS is CentOS 7.
 Here is my mpiexec command, minus all of the trailing arguments that don’t 
affect mpiexec.
 mpiexec --enable-recovery \
    --mca rmaps_base_verbose 5 \
    --display-allocation \
    --merge-stderr-to-stdout \
    --mca mpi_param_check 1 \
    --v \
    --x DISPLAY \
    --map-by node \
    -np 21  \
    -wdir ${work_dir}  …
 Here is my qsub command for the program “Needles”.

 qsub -V -j oe -e $tmpdir_stdio -o $tmpdir_stdio -f -X -N Needles -l 
nodes=21:ppn=9  RunNeedles.bash;
  From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users
Sent: Wednesday, November 3, 2021 11:58 AM
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Ralph Castain mailto:r...@open-mpi.org> >
Subject: [EXTERNAL] Re: [OMPI users] Reservin

Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn

2021-11-05 Thread Ralph Castain via users
Here is the problem:

[n022.cluster.com:30045 <http://n022.cluster.com:30045> ] [[36230,0],0] using 
dash_host n022
[n022.cluster.com:30045 <http://n022.cluster.com:30045> ] [[36230,0],0] 
Removing node n022 slots 1 inuse 1
--
All nodes which are allocated for this job are already filled.
--

Looks like your program is passing a "dash-host" MPI info key to the Comm_spawn 
request and listing host "n022". This translates into assigning only one slot 
to that host, which indeed has already been filled. If you want to tell OMPI to 
use that host with _all_ slots available, then you need to change that 
"dash-host" info to be "n022:*", or replace the asterisk with the number of 
procs you want to allow on that node.


On Nov 5, 2021, at 7:37 AM, Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov> > wrote:

Ralph,
 I configured my build with –enable-debug and added "--mca rmaps_base_verbose 
5" to the mpiexec command line.   I have attached the job output.   Thanks for 
being willing to look at this problem.
 My complete configure command is as follows:
 $ ./configure --enable-shared --enable-static --with-tm=/opt/torque 
--enable-mpi-cxx --enable-cxx-exceptions --disable-wrapper-runpath 
--prefix=/opt/openmpi_pgc_tm  CC=nvc CXX=nvc++ FC=pgfortran CPP=cpp CFLAGS="-O0 
-tp p7-64 -c99" CXXFLAGS="-O0 -tp p7-64" FCFLAGS="-O0 -tp p7-64" --enable-debug 
--enable-memchecker --with-valgrind=/home/kmccall/valgrind_install
 The nvc++ version is “nvc++ 20.9-0 LLVM 64-bit target on x86-64 Linux -tp 
haswell".
 Our OS is CentOS 7.
 Here is my mpiexec command, minus all of the trailing arguments that don’t 
affect mpiexec.
 mpiexec --enable-recovery \
    --mca rmaps_base_verbose 5 \
    --display-allocation \
    --merge-stderr-to-stdout \
    --mca mpi_param_check 1 \
    --v \
    --x DISPLAY \
    --map-by node \
    -np 21  \
    -wdir ${work_dir}  …
 Here is my qsub command for the program “Needles”.

 qsub -V -j oe -e $tmpdir_stdio -o $tmpdir_stdio -f -X -N Needles -l 
nodes=21:ppn=9  RunNeedles.bash;
  From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users
Sent: Wednesday, November 3, 2021 11:58 AM
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Ralph Castain mailto:r...@open-mpi.org> >
Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job 
launch with MPI_Comm_spawn
 Could you please ensure it was configured with --enable-debug and then add 
"--mca rmaps_base_verbose 5" to the mpirun cmd line?
 

On Nov 3, 2021, at 9:10 AM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:
 Gilles and Ralph,
 I did build with -with-tm.   I tried Gilles workaround but the failure still 
occurred.    What do I need to provide you so that you can investigate this 
possible bug?
 Thanks,
Kurt
 From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users
Sent: Wednesday, November 3, 2021 8:45 AM
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Ralph Castain mailto:r...@open-mpi.org> >
Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job 
launch with MPI_Comm_spawn
 Sounds like a bug to me - regardless of configuration, if the hostfile 
contains an entry for each slot on a node, OMPI should have added those up.
 


On Nov 3, 2021, at 2:49 AM, Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org> > wrote:
 Kurt,
 Assuming you built Open MPI with tm support (default if tm is detected at 
configure time, but you can configure --with-tm to have it abort if tm support 
is not found), you should not need to use a hostfile.
 As a workaround, I would suggest you try to
mpirun --map-by node -np 21 ...
  Cheers,
 Gilles
 On Wed, Nov 3, 2021 at 6:06 PM Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:
I’m using OpenMPI 4.1.1 compiled with Nvidia’s nvc++ 20.9, and compiled with 
Torque support.
 I want to reserve multiple slots on each node, and then launch a single 
manager process on each node.   The remaining slots would be filled up as the 
manager spawns new processes with MPI_Comm_spawn on its local node.
 Here is the abbreviated mpiexec command, which I assume is the source of the 
problem described below (?).   The hostfile was created by Torque and it 
contains many repeated node names, one for each slot that it reserved.   
 $ mpiexec --hostfile  MyHostFile  -np 21 -npernode 1  (etc.)
  When MPI_Comm_spawn is called, MPI is reporting that “All nodes which are 
allocated for this job are already filled."   They don’t appear to be filled as 
it also reports that only one slot 

Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn

2021-11-03 Thread Ralph Castain via users
Could you please ensure it was configured with --enable-debug and then add 
"--mca rmaps_base_verbose 5" to the mpirun cmd line?


On Nov 3, 2021, at 9:10 AM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:

Gilles and Ralph,
 I did build with -with-tm.   I tried Gilles workaround but the failure still 
occurred.    What do I need to provide you so that you can investigate this 
possible bug?
 Thanks,
Kurt
 From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users
Sent: Wednesday, November 3, 2021 8:45 AM
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Ralph Castain mailto:r...@open-mpi.org> >
Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job 
launch with MPI_Comm_spawn
 Sounds like a bug to me - regardless of configuration, if the hostfile 
contains an entry for each slot on a node, OMPI should have added those up.
 

On Nov 3, 2021, at 2:49 AM, Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org> > wrote:
 Kurt,
 Assuming you built Open MPI with tm support (default if tm is detected at 
configure time, but you can configure --with-tm to have it abort if tm support 
is not found), you should not need to use a hostfile.
 As a workaround, I would suggest you try to
mpirun --map-by node -np 21 ...
  Cheers,
 Gilles
 On Wed, Nov 3, 2021 at 6:06 PM Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:
I’m using OpenMPI 4.1.1 compiled with Nvidia’s nvc++ 20.9, and compiled with 
Torque support.
 I want to reserve multiple slots on each node, and then launch a single 
manager process on each node.   The remaining slots would be filled up as the 
manager spawns new processes with MPI_Comm_spawn on its local node.
 Here is the abbreviated mpiexec command, which I assume is the source of the 
problem described below (?).   The hostfile was created by Torque and it 
contains many repeated node names, one for each slot that it reserved.   
 $ mpiexec --hostfile  MyHostFile  -np 21 -npernode 1  (etc.)
  When MPI_Comm_spawn is called, MPI is reporting that “All nodes which are 
allocated for this job are already filled."   They don’t appear to be filled as 
it also reports that only one slot is in use for each node:
 ==   ALLOCATED NODES   ==
    n022: flags=0x11 slots=9 max_slots=0 slots_inuse=1 state=UP
    n021: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n020: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n018: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n017: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n016: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n015: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n014: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n013: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n012: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n011: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n010: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n009: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n008: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n007: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n006: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n005: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n004: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n003: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n002: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
    n001: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
 Do you have any idea what I am doing wrong?   My Torque qsub arguments are 
unchanged from when I successfully launched this kind of job structure under 
MPICH.   The relevant argument to qsub is the resource list, which is “-l  
nodes=21:ppn=9”.



Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn

2021-11-03 Thread Ralph Castain via users
Sounds like a bug to me - regardless of configuration, if the hostfile contains 
an entry for each slot on a node, OMPI should have added those up.


On Nov 3, 2021, at 2:49 AM, Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org> > wrote:

Kurt,

Assuming you built Open MPI with tm support (default if tm is detected at 
configure time, but you can configure --with-tm to have it abort if tm support 
is not found), you should not need to use a hostfile.

As a workaround, I would suggest you try to
mpirun --map-by node -np 21 ...


Cheers,

Gilles

On Wed, Nov 3, 2021 at 6:06 PM Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:
I’m using OpenMPI 4.1.1 compiled with Nvidia’s nvc++ 20.9, and compiled with 
Torque support.

 
I want to reserve multiple slots on each node, and then launch a single manager 
process on each node.   The remaining slots would be filled up as the manager 
spawns new processes with MPI_Comm_spawn on its local node.

 
Here is the abbreviated mpiexec command, which I assume is the source of the 
problem described below (?).   The hostfile was created by Torque and it 
contains many repeated node names, one for each slot that it reserved.  

 
$ mpiexec --hostfile  MyHostFile  -np 21 -npernode 1  (etc.)

 
 
When MPI_Comm_spawn is called, MPI is reporting that “All nodes which are 
allocated for this job are already filled."   They don’t appear to be filled as 
it also reports that only one slot is in use for each node:

 
==   ALLOCATED NODES   ==

    n022: flags=0x11 slots=9 max_slots=0 slots_inuse=1 state=UP

    n021: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n020: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n018: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n017: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n016: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n015: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n014: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n013: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n012: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n011: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n010: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n009: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n008: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n007: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n006: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n005: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n004: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n003: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n002: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

    n001: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

 
Do you have any idea what I am doing wrong?   My Torque qsub arguments are 
unchanged from when I successfully launched this kind of job structure under 
MPICH.   The relevant argument to qsub is the resource list, which is “-l  
nodes=21:ppn=9”.

 



Re: [OMPI users] [External] Re: cpu binding of mpirun to follow slurm setting

2021-10-11 Thread Ralph Castain via users
mpirun reads the allocation from the environment - there is no need to create a 
hostfile.


> On Oct 11, 2021, at 11:12 AM, Sheppard, Raymond W  wrote:
> 
> Hi,
>  Personally, I have had trouble with Slurm not wanting to give mpirun a 
> hostfile to work with.  How do you get around that?  Thanks.
>Ray
> 
> 
> From: users  on behalf of Ralph Castain via 
> users 
> Sent: Monday, October 11, 2021 1:49 PM
> To: Open MPI Users
> Cc: Ralph Castain
> Subject: Re: [OMPI users] [External] Re: cpu binding of mpirun to follow 
> slurm setting
> 
> Oh my - that is a pretty strong statement. It depends on what you are trying 
> to do, and whether or not Slurm offers a mapping pattern that matches. mpirun 
> tends to have a broader range of options, which is why many people use it. It 
> also means that your job script is portable and not locked to a specific RM, 
> which is important to quite a few users.
> 
> However, if Slurm has something you can use/like and you don't need to worry 
> about portability, then by all means one should use it.
> 
> Just don't assume that everyone fits in that box :-)
> 
> 
> On Oct 11, 2021, at 10:40 AM, Chang Liu via users 
> mailto:users@lists.open-mpi.org>> wrote:
> 
> OK thank you. Seems that srun is a better option for normal users.
> 
> Chang
> 
> On 10/11/21 1:23 PM, Ralph Castain via users wrote:
> Sorry, your output wasn't clear about cores vs hwthreads. Apparently, your 
> Slurm config is setup to use hwthreads as independent cpus - what you are 
> calling "logical cores", which is a little confusing.
> No, mpirun has no knowledge of what mapping pattern you passed to salloc. We 
> don't have any good way of obtaining config information, for one thing - 
> e.g., that Slurm is treating hwthreads as cpus. So we can't really interpret 
> what they might have done.
> Given this clarification, you can probably get what you want with:
> mpirun --use-hwthread-cpus --map-by hwthread:pe=2 ..."
> On Oct 11, 2021, at 7:35 AM, Chang Liu via users 
> mailto:users@lists.open-mpi.org> 
> <mailto:users@lists.open-mpi.org>> wrote:
> 
> This is not what I need. The cpu can run 4 threads per core, so "--bind-to 
> core" results in one process occupying 4 logical cores.
> 
> I want one process to occupy 2 logical cores, so two processes sharing a 
> physical core.
> 
> I guess there is a way to do that by playing with mapping. I just want to 
> know if this is a bug in mpirun, or this feature for interacting with slurm 
> was never implemented.
> 
> Chang
> 
> On 10/11/21 10:07 AM, Ralph Castain via users wrote:
> You just need to tell mpirun that you want your procs to be bound to cores, 
> not socket (which is the default).
> Add "--bind-to core" to your mpirun cmd line
> On Oct 10, 2021, at 11:17 PM, Chang Liu via users 
> mailto:users@lists.open-mpi.org> 
> <mailto:users@lists.open-mpi.org><mailto:users@lists.open-mpi.org 
> <mailto:users@lists.open-mpi.org>>> wrote:
> 
> Yes they are. This is an interactive job from
> 
> salloc -N 1 --ntasks-per-node=64 --cpus-per-task=2 --gpus-per-node=4 
> --gpu-mps --time=24:00:00
> 
> Chang
> 
> On 10/11/21 2:09 AM, Åke Sandgren via users wrote:
> On 10/10/21 5:38 PM, Chang Liu via users wrote:
> OMPI v4.1.1-85-ga39a051fd8
> 
> % srun bash -c "cat /proc/self/status|grep Cpus_allowed_list"
> Cpus_allowed_list:  58-59
> Cpus_allowed_list:  106-107
> Cpus_allowed_list:  110-111
> Cpus_allowed_list:  114-115
> Cpus_allowed_list:  16-17
> Cpus_allowed_list:  36-37
> Cpus_allowed_list:  54-55
> ...
> 
> % mpirun bash -c "cat /proc/self/status|grep Cpus_allowed_list"
> Cpus_allowed_list:  0-127
> Cpus_allowed_list:  0-127
> Cpus_allowed_list:  0-127
> Cpus_allowed_list:  0-127
> Cpus_allowed_list:  0-127
> Cpus_allowed_list:  0-127
> Cpus_allowed_list:  0-127
> ...
> Was that run in the same batch job? If not, the data is useless.
> 
> --
> Chang Liu
> Staff Research Physicist
> +1 609 243 3438
> c...@pppl.gov<mailto:c...@pppl.gov> 
> <mailto:c...@pppl.gov><mailto:c...@pppl.gov <mailto:c...@pppl.gov>>
> Princeton Plasma Physics Laboratory
> 100 Stellarator Rd, Princeton NJ 08540, USA
> 
> --
> Chang Liu
> Staff Research Physicist
> +1 609 243 3438
> c...@pppl.gov<mailto:c...@pppl.gov> <mailto:c...@pppl.gov>
> Princeton Plasma Physics Laboratory
> 100 Stellarator Rd, Princeton NJ 08540, USA
> 
> --
> Chang Liu
> Staff Research Physicist
> +1 609 243 3438
> c...@pppl.gov<mailto:c...@pppl.gov>
> Princeton Plasma Physics Laboratory
> 100 Stellarator Rd, Princeton NJ 08540, USA
> 




Re: [OMPI users] [External] Re: cpu binding of mpirun to follow slurm setting

2021-10-11 Thread Ralph Castain via users
Oh my - that is a pretty strong statement. It depends on what you are trying to 
do, and whether or not Slurm offers a mapping pattern that matches. mpirun 
tends to have a broader range of options, which is why many people use it. It 
also means that your job script is portable and not locked to a specific RM, 
which is important to quite a few users.

However, if Slurm has something you can use/like and you don't need to worry 
about portability, then by all means one should use it.

Just don't assume that everyone fits in that box :-)


On Oct 11, 2021, at 10:40 AM, Chang Liu via users mailto:users@lists.open-mpi.org> > wrote:

OK thank you. Seems that srun is a better option for normal users.

Chang

On 10/11/21 1:23 PM, Ralph Castain via users wrote:
Sorry, your output wasn't clear about cores vs hwthreads. Apparently, your 
Slurm config is setup to use hwthreads as independent cpus - what you are 
calling "logical cores", which is a little confusing.
No, mpirun has no knowledge of what mapping pattern you passed to salloc. We 
don't have any good way of obtaining config information, for one thing - e.g., 
that Slurm is treating hwthreads as cpus. So we can't really interpret what 
they might have done.
Given this clarification, you can probably get what you want with:
mpirun --use-hwthread-cpus --map-by hwthread:pe=2 ..."
On Oct 11, 2021, at 7:35 AM, Chang Liu via users mailto:users@lists.open-mpi.org>  <mailto:users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> >> wrote:

This is not what I need. The cpu can run 4 threads per core, so "--bind-to 
core" results in one process occupying 4 logical cores.

I want one process to occupy 2 logical cores, so two processes sharing a 
physical core.

I guess there is a way to do that by playing with mapping. I just want to know 
if this is a bug in mpirun, or this feature for interacting with slurm was 
never implemented.

Chang

On 10/11/21 10:07 AM, Ralph Castain via users wrote:
You just need to tell mpirun that you want your procs to be bound to cores, not 
socket (which is the default).
Add "--bind-to core" to your mpirun cmd line
On Oct 10, 2021, at 11:17 PM, Chang Liu via users mailto:users@lists.open-mpi.org>  <mailto:users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> ><mailto:users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org>  <mailto:users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> >>> wrote:

Yes they are. This is an interactive job from

salloc -N 1 --ntasks-per-node=64 --cpus-per-task=2 --gpus-per-node=4 --gpu-mps 
--time=24:00:00

Chang

On 10/11/21 2:09 AM, Åke Sandgren via users wrote:
On 10/10/21 5:38 PM, Chang Liu via users wrote:
OMPI v4.1.1-85-ga39a051fd8

% srun bash -c "cat /proc/self/status|grep Cpus_allowed_list"
Cpus_allowed_list:  58-59
Cpus_allowed_list:  106-107
Cpus_allowed_list:  110-111
Cpus_allowed_list:  114-115
Cpus_allowed_list:  16-17
Cpus_allowed_list:  36-37
Cpus_allowed_list:  54-55
...

% mpirun bash -c "cat /proc/self/status|grep Cpus_allowed_list"
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
...
Was that run in the same batch job? If not, the data is useless.

--
Chang Liu
Staff Research Physicist
+1 609 243 3438
c...@pppl.gov <mailto:c...@pppl.gov>  <mailto:c...@pppl.gov 
<mailto:c...@pppl.gov> ><mailto:c...@pppl.gov <mailto:c...@pppl.gov>  
<mailto:c...@pppl.gov <mailto:c...@pppl.gov> >>
Princeton Plasma Physics Laboratory
100 Stellarator Rd, Princeton NJ 08540, USA

--
Chang Liu
Staff Research Physicist
+1 609 243 3438
c...@pppl.gov <mailto:c...@pppl.gov>  <mailto:c...@pppl.gov 
<mailto:c...@pppl.gov> >
Princeton Plasma Physics Laboratory
100 Stellarator Rd, Princeton NJ 08540, USA

-- 
Chang Liu
Staff Research Physicist
+1 609 243 3438
c...@pppl.gov <mailto:c...@pppl.gov> 
Princeton Plasma Physics Laboratory
100 Stellarator Rd, Princeton NJ 08540, USA



Re: [OMPI users] [External] Re: cpu binding of mpirun to follow slurm setting

2021-10-11 Thread Ralph Castain via users
Sorry, your output wasn't clear about cores vs hwthreads. Apparently, your 
Slurm config is setup to use hwthreads as independent cpus - what you are 
calling "logical cores", which is a little confusing.

No, mpirun has no knowledge of what mapping pattern you passed to salloc. We 
don't have any good way of obtaining config information, for one thing - e.g., 
that Slurm is treating hwthreads as cpus. So we can't really interpret what 
they might have done.

Given this clarification, you can probably get what you want with:

mpirun --use-hwthread-cpus --map-by hwthread:pe=2 ..."



On Oct 11, 2021, at 7:35 AM, Chang Liu via users mailto:users@lists.open-mpi.org> > wrote:

This is not what I need. The cpu can run 4 threads per core, so "--bind-to 
core" results in one process occupying 4 logical cores.

I want one process to occupy 2 logical cores, so two processes sharing a 
physical core.

I guess there is a way to do that by playing with mapping. I just want to know 
if this is a bug in mpirun, or this feature for interacting with slurm was 
never implemented.

Chang

On 10/11/21 10:07 AM, Ralph Castain via users wrote:
You just need to tell mpirun that you want your procs to be bound to cores, not 
socket (which is the default).
Add "--bind-to core" to your mpirun cmd line
On Oct 10, 2021, at 11:17 PM, Chang Liu via users mailto:users@lists.open-mpi.org>  <mailto:users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> >> wrote:

Yes they are. This is an interactive job from

salloc -N 1 --ntasks-per-node=64 --cpus-per-task=2 --gpus-per-node=4 --gpu-mps 
--time=24:00:00

Chang

On 10/11/21 2:09 AM, Åke Sandgren via users wrote:
On 10/10/21 5:38 PM, Chang Liu via users wrote:
OMPI v4.1.1-85-ga39a051fd8

% srun bash -c "cat /proc/self/status|grep Cpus_allowed_list"
Cpus_allowed_list:  58-59
Cpus_allowed_list:  106-107
Cpus_allowed_list:  110-111
Cpus_allowed_list:  114-115
Cpus_allowed_list:  16-17
Cpus_allowed_list:  36-37
Cpus_allowed_list:  54-55
...

% mpirun bash -c "cat /proc/self/status|grep Cpus_allowed_list"
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
...
Was that run in the same batch job? If not, the data is useless.

--
Chang Liu
Staff Research Physicist
+1 609 243 3438
c...@pppl.gov <mailto:c...@pppl.gov>  <mailto:c...@pppl.gov 
<mailto:c...@pppl.gov> >
Princeton Plasma Physics Laboratory
100 Stellarator Rd, Princeton NJ 08540, USA

-- 
Chang Liu
Staff Research Physicist
+1 609 243 3438
c...@pppl.gov <mailto:c...@pppl.gov> 
Princeton Plasma Physics Laboratory
100 Stellarator Rd, Princeton NJ 08540, USA



Re: [OMPI users] [External] Re: cpu binding of mpirun to follow slurm setting

2021-10-11 Thread Ralph Castain via users
You just need to tell mpirun that you want your procs to be bound to cores, not 
socket (which is the default).

Add "--bind-to core" to your mpirun cmd line


On Oct 10, 2021, at 11:17 PM, Chang Liu via users mailto:users@lists.open-mpi.org> > wrote:

Yes they are. This is an interactive job from

salloc -N 1 --ntasks-per-node=64 --cpus-per-task=2 --gpus-per-node=4 --gpu-mps 
--time=24:00:00

Chang

On 10/11/21 2:09 AM, Åke Sandgren via users wrote:
On 10/10/21 5:38 PM, Chang Liu via users wrote:
OMPI v4.1.1-85-ga39a051fd8

% srun bash -c "cat /proc/self/status|grep Cpus_allowed_list"
Cpus_allowed_list:  58-59
Cpus_allowed_list:  106-107
Cpus_allowed_list:  110-111
Cpus_allowed_list:  114-115
Cpus_allowed_list:  16-17
Cpus_allowed_list:  36-37
Cpus_allowed_list:  54-55
...

% mpirun bash -c "cat /proc/self/status|grep Cpus_allowed_list"
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
Cpus_allowed_list:  0-127
...
Was that run in the same batch job? If not, the data is useless.

-- 
Chang Liu
Staff Research Physicist
+1 609 243 3438
c...@pppl.gov  
Princeton Plasma Physics Laboratory
100 Stellarator Rd, Princeton NJ 08540, USA



Re: [OMPI users] cpu binding of mpirun to follow slurm setting

2021-10-10 Thread Ralph Castain via users
Could you please include (a) what version of OMPI you are talking about, and 
(b) the binding patterns you observed from both srun and mpirun?


> On Oct 9, 2021, at 6:41 PM, Chang Liu via users  
> wrote:
> 
> Hi,
> 
> I wonder if mpirun can follow the cpu binding settings from slurm, when 
> running under the slurm environment.
> 
> Currently, in my tests, when I run
> 
> srun bash -c "cat /proc/self/status|grep Cpus_allowed_list"
> 
> I saw the correct binding set by salloc. When running
> 
> mpirun bash -c "cat /proc/self/status|grep Cpus_allowed_list"
> 
> The bindings are not correct.
> 
> Thanks,
> 
> Chang
> -- 
> Chang Liu
> Staff Research Physicist
> +1 609 243 3438
> c...@pppl.gov
> Princeton Plasma Physics Laboratory
> 100 Stellarator Rd, Princeton NJ 08540, USA




Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-08-11 Thread Ralph Castain via users
I'd suggest opening a ticket on the UCX repo itself. This looks to me like UCX 
not recognizing a Mellanox device, or at least not initializing it correctly.


> On Aug 11, 2021, at 8:21 AM, Ryan Novosielski  wrote:
> 
> Thanks. That /is/ one solution, and what I’ll do in the interim since this 
> has to work in at least some fashion, but I would actually like to use UCX if 
> OpenIB is going to be deprecated. How do I find out what’s actually wrong?
> 
> --
> #BlackLivesMatter
> 
> || \\UTGERS,   
> |---*O*---
> ||_// the State| Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ| Office of Advanced Research Computing - MSB C630, 
> Newark
> `'
> 
>> On Jul 29, 2021, at 11:35 AM, Ralph Castain via users 
>>  wrote:
>> 
>> So it _is_ UCX that is the problem! Try using OMPI_MCA_pml=ob1 instead
>> 
>>> On Jul 29, 2021, at 8:33 AM, Ryan Novosielski  wrote:
>>> 
>>> Thanks, Ralph. This /does/ change things, but not very much. I was not 
>>> under the impression that I needed to do that, since when I ran without 
>>> having built against UCX, it warned me about the openib method being 
>>> deprecated. By default, does OpenMPI not use either anymore, and I need to 
>>> specifically call for UCX? Seems strange.
>>> 
>>> Anyhow, I’ve got some variables defined still, in addition to your 
>>> suggestion, for verbosity:
>>> 
>>> [novosirj@amarel-test2 ~]$ env | grep ^OMPI
>>> OMPI_MCA_pml=ucx
>>> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
>>> OMPI_MCA_pml_ucx_verbose=100
>>> 
>>> Here goes:
>>> 
>>> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
>>> ./mpihello-gcc-8-openmpi-4.0.6
>>> srun: job 13995650 queued and waiting for resources
>>> srun: job 13995650 has been allocated resources
>>> --
>>> WARNING: There was an error initializing an OpenFabrics device.
>>> 
>>> Local host:   gpu004
>>> Local device: mlx4_0
>>> --
>>> --
>>> WARNING: There was an error initializing an OpenFabrics device.
>>> 
>>> Local host:   gpu004
>>> Local device: mlx4_0
>>> --
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using 
>>> OPAL memory hooks as external events
>>> [gpu004.amarel.rutgers.edu:29824] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using 
>>> OPAL memory hooks as external events
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 
>>> mca_pml_ucx_open: UCX version 1.5.2
>>> [gpu004.amarel.rutgers.edu:29824] 
>>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 
>>> mca_pml_ucx_open: UCX version 1.5.2
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>>> self/self: did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: 
>>> did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: 
>>> did not match transport list
>>> [gpu004.amarel.rutgers.edu:29824] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>>> self/self: did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>>> rc/mlx4_0:1: did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>>> ud/mlx4_0:1: did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: 
>>> did not match transport list
>>> [gpu004.amarel.rutgers.edu:29823] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: 
>>> did n

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-29 Thread Ralph Castain via users
t; [gpu004.amarel.rutgers.edu:29824] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: 
> did not match transport list
> [gpu004.amarel.rutgers.edu:29824] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: 
> did not match transport list
> [gpu004.amarel.rutgers.edu:29824] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: 
> did not match transport list
> [gpu004.amarel.rutgers.edu:29824] 
> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support 
> level is none
> --
> No components were able to be opened in the pml framework.
> 
> This typically means that either no components of this type were
> installed, or none of the installed components can be loaded.
> Sometimes this means that shared libraries required by these
> components are unable to be found/loaded.
> 
>  Host:  gpu004
>  Framework: pml
> --
> [gpu004.amarel.rutgers.edu:29824] PML ucx cannot be selected
> slurmstepd: error: *** STEP 13995650.0 ON gpu004 CANCELLED AT 
> 2021-07-29T11:31:19 ***
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: gpu004: tasks 0-1: Exited with exit code 1
> 
> --
> #BlackLivesMatter
> ____
> || \\UTGERS,   
> |---*O*---
> ||_// the State| Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ| Office of Advanced Research Computing - MSB C630, 
> Newark
> `'
> 
>> On Jul 29, 2021, at 8:34 AM, Ralph Castain via users 
>>  wrote:
>> 
>> Ryan - I suspect what Sergey was trying to say was that you need to ensure 
>> OMPI doesn't try to use the OpenIB driver, or at least that it doesn't 
>> attempt to initialize it. Try adding
>> 
>> OMPI_MCA_pml=ucx
>> 
>> to your environment.
>> 
>> 
>>> On Jul 29, 2021, at 1:56 AM, Sergey Oblomov via users 
>>>  wrote:
>>> 
>>> Hi
>>> 
>>> This issue arrives from BTL OpenIB, not related to UCX
>>> 
>>> From: users  on behalf of Ryan 
>>> Novosielski via users 
>>> Date: Thursday, 29 July 2021, 08:25
>>> To: users@lists.open-mpi.org 
>>> Cc: Ryan Novosielski 
>>> Subject: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: 
>>> There was an error initializing an OpenFabrics device."
>>> 
>>> Hi there,
>>> 
>>> New to using UCX, as a result of having built OpenMPI without it and 
>>> running tests and getting warned. Installed UCX from the distribution:
>>> 
>>> [novosirj@amarel-test2 ~]$ rpm -qa ucx
>>> ucx-1.5.2-1.el7.x86_64
>>> 
>>> …and rebuilt OpenMPI. Built fine. However, I’m getting some pretty 
>>> unhelpful messages about not using the IB card. I looked around the 
>>> internet some and set a couple of environment variables to get a little 
>>> more information:
>>> 
>>> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
>>> export OMPI_MCA_pml_ucx_verbose=100
>>> 
>>> Here’s what happens:
>>> 
>>> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
>>> ./mpihello-gcc-8-openmpi-4.0.6 
>>> srun: job 13993927 queued and waiting for resources
>>> srun: job 13993927 has been allocated resources
>>> --
>>> WARNING: There was an error initializing an OpenFabrics device.
>>> 
>>> Local host:   gpu004
>>> Local device: mlx4_0
>>> --
>>> --
>>> WARNING: There was an error initializing an OpenFabrics device.
>>> 
>>> Local host:   gpu004
>>> Local device: mlx4_0
>>> --
>>> [gpu004.amarel.rutgers.edu:02327] 
>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using 
>>> OPAL memory hooks as external events
>>> [gpu004.amarel.rutgers.edu:02327] 
>>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 
>>> mca_pml_ucx_open: UCX version 1.5.2
>>> [gpu004.amarel.rutgers.edu:02326] 
>>> ../../..

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-29 Thread Ralph Castain via users
Ryan - I suspect what Sergey was trying to say was that you need to ensure OMPI 
doesn't try to use the OpenIB driver, or at least that it doesn't attempt to 
initialize it. Try adding

OMPI_MCA_pml=ucx

to your environment.


On Jul 29, 2021, at 1:56 AM, Sergey Oblomov via users mailto:users@lists.open-mpi.org> > wrote:

Hi
 This issue arrives from BTL OpenIB, not related to UCX
 From: users mailto:users-boun...@lists.open-mpi.org> > on behalf of Ryan Novosielski via 
users mailto:users@lists.open-mpi.org> >
Date: Thursday, 29 July 2021, 08:25
To: users@lists.open-mpi.org  
mailto:users@lists.open-mpi.org> >
Cc: Ryan Novosielski mailto:novos...@rutgers.edu> >
Subject: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There 
was an error initializing an OpenFabrics device."

Hi there,

New to using UCX, as a result of having built OpenMPI without it and running 
tests and getting warned. Installed UCX from the distribution:

[novosirj@amarel-test2 ~]$ rpm -qa ucx
ucx-1.5.2-1.el7.x86_64

…and rebuilt OpenMPI. Built fine. However, I’m getting some pretty unhelpful 
messages about not using the IB card. I looked around the internet some and set 
a couple of environment variables to get a little more information:

OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
export OMPI_MCA_pml_ucx_verbose=100

Here’s what happens:

[novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
./mpihello-gcc-8-openmpi-4.0.6 
srun: job 13993927 queued and waiting for resources
srun: job 13993927 has been allocated resources
--
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   gpu004
 Local device: mlx4_0
--
--
WARNING: There was an error initializing an OpenFabrics device.

 Local host:   gpu004
 Local device: mlx4_0
--
[gpu004.amarel.rutgers.edu:02327  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02327  ] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
memory hooks as external events
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: 
UCX version 1.5.2
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list
[gpu004.amarel.rutgers.edu:02327  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 rc/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 ud/mlx4_0:1: 
did not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: did 
not match transport list
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support level 
is none
[gpu004.amarel.rutgers.edu:02326  ] 
../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
[gpu004.amarel.rutgers.edu:02327  ] 
../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: did 
not match transport list
[gpu004.amarel.rutg

Re: [OMPI users] How to set parameters to utilize multiple network interfaces?

2021-06-11 Thread Ralph Castain via users
You can still use "map-by" to get what you want since you know there are four 
interfaces per node - just do "--map-by ppr:8:node". Note that you definitely 
do NOT want to list those multiple IP addresses in your hostfile - all you are 
doing is causing extra work for mpirun as it has to DNS resolve those addresses 
back down to their common host. We then totally ignore the fact that you 
specified those addresses, so it is accomplishing nothing (other than creating 
extra work).

You'll need to talk to AWS about how to drive striping across the interfaces. 
It sounds like they are automatically doing it, but perhaps not according to 
the algorithm you are seeking (i.e., they may not make such a linear assignment 
as you describe).


> On Jun 8, 2021, at 1:23 PM, John Moore via users  
> wrote:
> 
> Hello,
> 
> I am trying to run OpenMPI on AWSs new p4d instances. These instances have 4x 
> 100Gb/s network interfaces, each with their own ipv4 address.
> 
> I am primarily testing the bandwidth with the osu_micro_benchmarks test 
> suite. Specifically I am running the osu_bibw and osu_mbw_mr tests to 
> calculate the peak aggregate bandwidth I can achieve between two instances.
> 
> I have found that running the osu_biwb test can only obtain the achieved 
> throughput of one network interface (100 Gb/s).  This is the command I am 
> using:
> /opt/amazon/openmpi/bin/mpirun -v -x FI_EFA_USE_DEVICE_RDMA=1 -x 
> FI_PROVIDER="efa" -np 2 -host host1,host2 --map-by node --mca 
> btl_baes_verbose 30 --mca btl tcp,self --mca btl_tcp_if_exclude lo,do\cker0  
> ./osu_bw -m 4000
> 
> As far as I understand it, openmpi should be detecting the four interfaces 
> and striping data across them, correct?
> 
> I have found that the osu_mbw_mr test can achieve 4x the bandwidth of a 
> single network interface, if the configuration is correct. For example, I am 
> using the following command:
> /opt/amazon/openmpi/bin/mpirun -v -x FI_EFA_USE_DEVICE_RDMA=1 -x 
> FI_PROVIDER="efa" -np 8 -hostfile hostfile5 --map-by node --mca 
> btl_baes_verbose 30 --mca btl tcp,self --mca btl_tcp_if_exclude lo,d\ocker0  
> ./osu_mbw_mr
> This will run four pairs of send/recv calls across the different nodes. 
> hostfile5 contains all 8 local ipv4 addresses associated with the four nodes. 
> I believe this is why I am getting the expected performance.
> 
> So, now I want to runa real use case, but I can't use --map-by node. I want 
> to run two ranks per ipv4 address (interface) with the ranks ordered 
> sequentially according to the hostfile (the first 8 ranks will belong to the 
> first host, but the ranks will be divided among four ipv4 addresses to 
> utilize the full network bandwidth). But OpenMPI won't allow me to assign 
> slots=2 to each ipv4 address because they all belong to the same host. 
> 
> Any recommendation would be greatly appreciated.
> 
> Thanks,
> John




Re: [OMPI users] [Help] Must orted exit after all spawned proecesses exit

2021-05-19 Thread Ralph Castain via users
To answer your specific questions:

The backend daemons (orted) will not exit until all locally spawned procs exit. 
This is not configurable - for one thing, OMPI procs will suicide if they see 
the daemon depart, so it makes no sense to have the daemon fail if a proc 
terminates. The logic behind this behavior spans multiple parts of the code 
base, I'm afraid.

On May 17, 2021, at 7:03 AM, Jeff Squyres (jsquyres) via users 
mailto:users@lists.open-mpi.org> > wrote:

FYI: general Open MPI questions are better sent to the user's mailing list.

Up through the v4.1.x series, the "orted" is a general helper process that Open 
MPI uses on the back-end.  It will not quit until all of its children have 
died.  Open MPI's run time is designed with the intent that some external 
helper will be there for the entire duration of the job; there is no option to 
run without one.

Two caveats:

1. In Open MPI v5.0.x, from the user's perspective, "orted" has been renamed to 
be "prted".  Since this is 99.999% behind the scenes, most users won't notice 
the difference.

2. You can run without "orted" (or "prted") if you use a different run-time 
environment (e.g., SLURM).  In this case, you'll use that environment's 
launcher (e.g., srun or sbatch in SLURM environments) to directly launch MPI 
processes -- you won't use "mpirun" at all.  Fittingly, this is called "direct 
launch" in Open MPI parlance (i.e., using another run-time's daemons to launch 
processes instead of first launching orteds (or prteds).



On May 16, 2021, at 8:34 AM, 叶安华 mailto:yean...@sensetime.com> > wrote:

Code snippet: 
 # sleep.sh
sleep 10001 &
/bin/sh son_sleep.sh
sleep 10002
 # son_sleep.sh
sleep 10003 &
sleep 10004 &
 thanks
Anhua
  From: 叶安华 mailto:yean...@sensetime.com> >
Date: Sunday, May 16, 2021 at 20:31
To: "jsquy...@cisco.com  " mailto:jsquy...@cisco.com> >
Subject: [Help] Must orted exit after all spawned proecesses exit
 Dear Jeff, 
 Sorry to bother you but I am really curious about the conditions on which 
orted exits in the below scenario, and I am looking forward to hearing from you.
 Scenario description:
· Step 1: start a remote process via "mpirun -np 1 -host 10.211.55.4 sh 
sleep.sh"
· Step 2: check pstree in the remote host:

· Step 3: the mpirun process in step 1 does not exit until I kill all 
the sleeping process, which are 15479 15481 15482 15483
 To conclude, my questions are as follows:
1.  Must orted wait until all spawned processes exit?
2.  Is this behavior configurable? What if I want orted to exit immediately 
after any one of the spawned proecess exits?
3.  I did not find the specific logic about orted waiting for spawned 
proecesses to exit, hope I can get some hint from you.
 PS (scripts):

  thanks
Anhua
 

-- 
Jeff Squyres
jsquy...@cisco.com  






Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-19 Thread Ralph Castain via users
The original configure line is correct ("--without-orte") - just a typo in the 
later text.

You may be running into some issues with Slurm's built-in support for OMPI. Try 
running it with OMPI's "mpirun" instead and see if you get better performance. 
You'll have to reconfigure to remove the "--without-orte" and 
"--with-ompi-pmix-rte" options. I would also recommend removing the 
"--with-pmix=external --with-libevent=external --with-hwloc=xxx 
--with-libevent=xxx" entries.

In other words, get down to a vanilla installation so we know what we are 
dealing with - otherwise, it gets very hard to help you.


On May 19, 2021, at 7:09 AM, Jorge D'Elia via users mailto:users@lists.open-mpi.org> > wrote:

- Mensaje original -
De: "Pavel Mezentsev via users" mailto:users@lists.open-mpi.org> >
Para: users@lists.open-mpi.org  

CC: "Pavel Mezentsev" mailto:pavel.mezent...@gmail.com> >
Enviado: Miércoles, 19 de Mayo 2021 10:53:50
Asunto: Re: [OMPI users] unable to launch a job on a system with OmniPath

It took some time but my colleague was able to build OpenMPI and get it
working with OmniPath, however the performance is quite disappointing.
The configuration line used was the following: ./configure
--prefix=$INSTALL_PATH  --build=x86_64-pc-linux-gnu
--host=x86_64-pc-linux-gnu --enable-shared --with-hwloc=$EBROOTHWLOC

--with-psm2 --with-ofi=$EBROOTLIBFABRIC --with-libevent=$EBROOTLIBEVENT
--without-orte --disable-oshmem --with-gpfs --with-slurm
--with-pmix=external --with-libevent=external --with-ompi-pmix-rte

/usr/bin/srun --cpu-bind=none --mpi=pspmix --ntasks-per-node 1 -n 2 xenv -L
Architecture/KNL -L GCC -L OpenMPI env OMPI_MCA_btl_base_verbose="99"
OMPI_MCA_mtl_base_verbose="99" numactl --physcpubind=1 ./osu_bw
...
[node:18318] select: init of component ofi returned success
[node:18318] mca: base: components_register: registering framework mtl
components
[node:18318] mca: base: components_register: found loaded component ofi

[node:18318] mca: base: components_register: component ofi register
function successful
[node:18318] mca: base: components_open: opening mtl components

[node:18318] mca: base: components_open: found loaded component ofi

[node:18318] mca: base: components_open: component ofi open function
successful
[node:18318] mca:base:select: Auto-selecting mtl components
[node:18318] mca:base:select:(  mtl) Querying component [ofi]

[node:18318] mca:base:select:(  mtl) Query of component [ofi] set priority
to 25
[node:18318] mca:base:select:(  mtl) Selected component [ofi]

[node:18318] select: initializing mtl component ofi
[node:18318] mtl_ofi_component.c:378: mtl:ofi:provider: hfi1_0
...
# OSU MPI Bandwidth Test v5.7
# Size  Bandwidth (MB/s)
1   0.05
2   0.10
4   0.20
8   0.41
16  0.77
32  1.54
64  3.10
128 6.09
256    12.39
512    24.23
1024   46.85
2048   87.99
4096  100.72
8192  139.91
16384 173.67
32768 197.82
65536 210.15
131072    215.76
262144    214.39
524288    219.23
1048576   223.53
2097152   226.93
4194304   227.62

If I test directly with `ib_write_bw` I get
#bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
MsgRate[Mpps]
Conflicting CPU frequency values detected: 1498.727000 != 1559.017000. CPU
Frequency is not max.
65536  5000 2421.04    2064.33    0.033029

I also tried adding `OMPI_MCA_mtl="psm2"` however the job crashes in that
case:
```
Error obtaining unique transport key from ORTE
(orte_precondition_transports not present in

the environment).
```
Which is a bit puzzling considering that OpenMPI was build with
`--witout-orte`

Dear Pavel, I can't help you but just in case in the text:

Which is a bit puzzling considering that OpenMPI was build with
`--witout-orte`

it should be `--without-orte` ??


Regards, Jorge D' Elia.
--
CIMEC (UNL-CONICET), http://www.cimec.org.ar/  
Predio CONICET-Santa Fe, Colec. Ruta Nac. 168, 
Paraje El Pozo, 3000, Santa Fe, ARGENTINA. 
Tel +54-342-4511594/95 ext 7062, fax: +54-342-4511169


What am I missing and how can I improve the performance?

Regards, Pavel Mezentsev.

On Mon, May 10, 2021 at 6:20 PM Heinz, Michael William <
michael.william.he...@cornelisnetworks.com 
 > wrote:

*That warning is an annoying bit of cruft from the openib / verbs provider
that can be ignored. (Actually, I recommend using “—btl ^openib” to
suppress the warning.)*



*That said, there is a known issue with selecting PSM2 and OMPI 4.1.0. I’m
not sure that that’s the problem you’re hitting, though, because you really
haven’t provi

Re: [OMPI users] How do I launch workers by our private protocol?

2021-04-21 Thread Ralph Castain via users
I'm not sure we support what you are wanting to do.

You can direct mpiexec to use a specified script to launch its daemons on 
remote nodes. The daemons will need to connect back via TCP to mpiexec. The 
daemons are responsible for fork/exec'ing the local MPI application procs on 
each node. Those procs connect back to their daemon via TCP, but only locally 
on the node.

mpiexec cannot launch application procs directly on another node. It needs the 
daemon to support it.

If that fits into your work environment, then you should be okay.

On Apr 20, 2021, at 12:22 AM, hihijo07 via users mailto:users@lists.open-mpi.org> > wrote:


Hello everyone,

In my working place, we have used a tool to launch user's job like schedulers.

Recently, we have encountered a technical issue that our users need to launch 
MPI jobs since additional use cases from new users are coming into our 
computing environment. 

The solution we are looking for is; launch a script on a master host by mpiexec 
then, the mpiexec launches worker processes on some of other hosts by a script 
or binary that we internally made. The executable might use socket to contact 
to a process on each remote hosts.

In case of LSF,  mpiexec can launch processes by blaunch if we build OpenMPI 
with an option for LSF so, I think we need sort of that behavior.

Can I launch processes on remote hosts from a master host by calling mpiexec 
with our executable?

Thanks.

Sent from my Galaxy




Re: [OMPI users] Dynamic process allocation hangs

2021-03-25 Thread Ralph Castain via users
Hmmm...disturbing. The changes I made have somehow been lost. I'll have to redo 
it - will get back to you when it is restored.


On Mar 25, 2021, at 2:54 PM, L Lutret mailto:lu.lut...@gmail.com> > wrote:

Hi Ralph,

Thanks for your response. I tried with the master branch a very simple spawn 
from a singleton, in three ways: 

a) running with a worker host with the add-host key in MPI_Info_set()
b) running with a worker host with the new PMIX_ADD_HOSTFILE key in 
MPI_Info_set()
c) running just in localhost  -i.e. MPI_Comm_spawn( ... MPI_INFO_NULL ...)-

but the output is the same; this error message:

prte: Error: unknown option "--singleton"
Type 'prte --help' for usage.
[osboxes:06532] OPAL ERROR: Error in file dpm/dpm.c at line 2168
[osboxes:0] *** An error occurred in MPI_Comm_spawn
[osboxes:0] *** reported by process [1835008000,0]
[osboxes:0] *** on communicator MPI_COMM_SELF
[osboxes:0] *** MPI_ERR_UNKNOWN: unknown error
[osboxes:0] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[osboxes:0] ***    and MPI will try to terminate your MPI job as well)

Thank you very much for your help. 
Regards.

 
On Wed, Mar 24, 2021 at 10:07 AM Ralph Castain mailto:r...@open-mpi.org> > wrote:
Apologies for the very long delay in response. This has been verified fixed in 
OMPI's master branch that is to be released as v5.0 in the near future. 
Unfortunately, there are no plans to backport that fi to earlier release 
series. We therefore recommend that you upgrade to v5.0 if you retain interest 
in this feature.

Again, our apologies for the delayed response. You are welcome to use the 
nightly tarballs (https://www.open-mpi.org/nightly/v5.0.x/ or 
https://www.open-mpi.org/nightly/master/) in the interim, and please do let us 
know if the problem persists.
Ralph


On Dec 13, 2019, at 11:34 AM, L Lutret via users mailto:users@lists.open-mpi.org> > wrote:

Hello all. Doesn't anyone a clue about this issue? Thanks.

-- Forwarded message -
From: L Lutret mailto:lu.lut...@gmail.com> >
Date: Tue, Oct 8, 2019 at 12:59 PM
Subject: Dynamic process allocation hangs
To: mailto:users@lists.open-mpi.org> >


Hello all. Im started some test with Openmpi 4.0.1. I have two machines, one 
local, the other remote. I have used ssh connection. Some basic test (hello.c 
script) runs ok local and remote with mpirun. But I need to run a script 
without mpirun and generate with spawn some processes. Here some examples that 
what I get.



My hostfile:

cat hostfile



    localhost slots=4

    slave1 slots=4



If I set this: 



    MPI_Info_set( info, "add-hostfile", "hostfile" );

    MPI_Info_set( info, "npernode", "3" );



And I run 6 processes (i.e. MPI_Comm_spawn() receives 6 procceses to run):



     ./dyamic.o



Its Runs Ok: 4 procceses local and 3 remote



Now, If I set (without add-hostfile and npernode):



      MPI_Info_set( info, "add-host", "slave1,slave1,slave1,slave1" );



And I run 4 processes... its hangs, but I can see with Top one running 
processes on local and 4 on remote (slave1), that I think Its ok however. After 
a while It throws this: 



“A request has timed out and will therefore fail:



 Operation: LOOKUP: orted/pmix/pmix_server_pub.c:345



Your job may terminate as a result of this problem. You may want to

adjust the MCA parameter pmix_server_max_wait and try again. If this

occurred during a connect/accept operation, you can adjust that time

using the pmix_base_exchange_timeout parameter.

--

[master:22881] *** An error occurred in MPI_Comm_spawn

[master:22881] *** reported by process [63766529,0]

[master:22881] *** on communicator MPI_COMM_WORLD

[master:22881] *** MPI_ERR_UNKNOWN: unknown error

[master:22881] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,

[master:22881] *** and potentially your MPI job)”



I watch with Top now and there are not any processes running.

I really need this type of allocation. Any help It will be very, very 
appreciated. Thanks in advance.





Re: [OMPI users] Dynamic process allocation hangs

2021-03-24 Thread Ralph Castain via users
Apologies for the very long delay in response. This has been verified fixed in 
OMPI's master branch that is to be released as v5.0 in the near future. 
Unfortunately, there are no plans to backport that fi to earlier release 
series. We therefore recommend that you upgrade to v5.0 if you retain interest 
in this feature.

Again, our apologies for the delayed response. You are welcome to use the 
nightly tarballs (https://www.open-mpi.org/nightly/v5.0.x/ or 
https://www.open-mpi.org/nightly/master/) in the interim, and please do let us 
know if the problem persists.
Ralph


On Dec 13, 2019, at 11:34 AM, L Lutret via users mailto:users@lists.open-mpi.org> > wrote:

Hello all. Doesn't anyone a clue about this issue? Thanks.

-- Forwarded message -
From: L Lutret mailto:lu.lut...@gmail.com> >
Date: Tue, Oct 8, 2019 at 12:59 PM
Subject: Dynamic process allocation hangs
To: mailto:users@lists.open-mpi.org> >


Hello all. Im started some test with Openmpi 4.0.1. I have two machines, one 
local, the other remote. I have used ssh connection. Some basic test (hello.c 
script) runs ok local and remote with mpirun. But I need to run a script 
without mpirun and generate with spawn some processes. Here some examples that 
what I get.



My hostfile:

cat hostfile



    localhost slots=4

    slave1 slots=4



If I set this: 



    MPI_Info_set( info, "add-hostfile", "hostfile" );

    MPI_Info_set( info, "npernode", "3" );



And I run 6 processes (i.e. MPI_Comm_spawn() receives 6 procceses to run):



     ./dyamic.o



Its Runs Ok: 4 procceses local and 3 remote



Now, If I set (without add-hostfile and npernode):



      MPI_Info_set( info, "add-host", "slave1,slave1,slave1,slave1" );



And I run 4 processes... its hangs, but I can see with Top one running 
processes on local and 4 on remote (slave1), that I think Its ok however. After 
a while It throws this: 



“A request has timed out and will therefore fail:



 Operation: LOOKUP: orted/pmix/pmix_server_pub.c:345



Your job may terminate as a result of this problem. You may want to

adjust the MCA parameter pmix_server_max_wait and try again. If this

occurred during a connect/accept operation, you can adjust that time

using the pmix_base_exchange_timeout parameter.

--

[master:22881] *** An error occurred in MPI_Comm_spawn

[master:22881] *** reported by process [63766529,0]

[master:22881] *** on communicator MPI_COMM_WORLD

[master:22881] *** MPI_ERR_UNKNOWN: unknown error

[master:22881] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,

[master:22881] *** and potentially your MPI job)”



I watch with Top now and there are not any processes running.

I really need this type of allocation. Any help It will be very, very 
appreciated. Thanks in advance.




Re: [OMPI users] building openshem on opa

2021-03-22 Thread Ralph Castain via users
You did everything right - the OSHMEM implementation in OMPI only supports UCX 
as it is essentially a Mellanox offering. I think the main impediment to 
broadening it is simply interest and priority on the part of the non-UCX 
developers.

> On Mar 22, 2021, at 7:51 AM, Michael Di Domenico via users 
>  wrote:
> 
> i can build and run openmpi on an opa network just fine, but it turns
> out building openshmem fails.  the message is (no spml) found
> 
> looking at the config log it looks like it tries to build spml ikrit
> and ucx which fail.  i turn ucx off because it doesn't support opa and
> isn't needed.
> 
> so this message is really just a confirmation that openshmem and opa
> are not capable of being built or did i do something wrong
> 
> and a curiosity if anyone knows what kind of effort would be involved
> in getting it to work




Re: [OMPI users] How do you change ports used? [EXT]

2021-03-19 Thread Ralph Castain via users
Let me briefly explain how MPI jobs start. mpirun launches a set of daemons, 
one per node. Each daemon has a "phone home" address passed to it on its cmd 
line. It opens a port (obtained from its local OS) and connects back to the 
port provided on its cmd line. This establishes a connection back to mpirun. 
This set of ports is the "oob" set.

mpirun then sends out a "launch msg" to every daemon telling them what ranks to 
start. When a daemon starts a rank, it provides (in the environment) its local 
port so that the rank can connect back to it. Once that connection is 
established, each rank opens its own ports (obtained independently from the 
local OS) for use by MPI - these are the "btl" ports. The rank sends that port 
information to its local daemon, and then the daemons do a global exchange so 
that every daemon winds up with a complete map of ranks to ports. This map is 
provided back to each rank.

So at no point does anyone need to know what ports are available on other hosts 
- they simply receive info on what port each rank is using. The problem here is 
that one or more of those ranks was unable to get a port in the range you 
specified because they were all apparently occupied.

If you don't have firewalls, then why are you trying to restrict the port range?


On Mar 19, 2021, at 5:47 AM, Sendu Bala mailto:s...@sanger.ac.uk> > wrote:

No firewall between nodes in the cluster.

OMPI may be asking localhost for available ports, but is it checking those 
ports are also available on all the other hosts it’s going to run on?


On 18 Mar 2021, at 15:57, Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote:

Hmmm...then you have something else going on. By default, OMPI will ask the 
local OS for an available port and use it. You only need to specify ports when 
working thru a firewall.

Do you have firewalls on this cluster?


On Mar 18, 2021, at 8:55 AM, Sendu Bala mailto:s...@sanger.ac.uk> > wrote:

Yes, that’s the trick. I’m going to have to check port usage on all hosts and 
pick suitable ranges just-in-time - and hope I don’t hit a race condition with 
other users of the cluster.

Does mpiexec not have this kind of functionality built in? When I use it with 
no port options set (pure default), it just doesn’t function (I’m guessing 
because it chose “bad” or in-use ports).



On 18 Mar 2021, at 14:11, Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote:

Hard to say - unless there is some reason, why not make it large enough to not 
be an issue? You may have to experiment a bit as there is nothing to guarantee 
that other processes aren't occupying those regions.



On Mar 18, 2021, at 2:13 AM, Sendu Bala mailto:s...@sanger.ac.uk> > wrote:

Thanks, it made it work when I was running “true” as a test, but then my real 
MPI app failed with:

[node-5-8-2][[48139,1],0][btl_tcp_component.c:966:mca_btl_tcp_component_create_listen]
 bind() failed: no port available in the range [46107..46139]
--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[48139,1],1]) is on host: node-12-6-2
  Process 2 ([[48139,1],0]) is on host: node-5-8-2
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.


This was when running with 16 cores, so I thought  a 32 port range would be 
fine. Is this telling me I have to make it a 33 port range, have different 
ranges for oob and btl, or that some other unrelated software is using some 
ports in my range?


(I changed my range from my previous post, because using that range resulted in 
the issue I posted about here before, where mpirun just does nothing for 5mins 
and then terminates itself, without any error messages.)


Cheers,
Sendu.


On 17 Mar 2021, at 13:25, Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote:

What you are missing is that there are _two_ messaging layers in the system. 
You told the btl/tcp layer to use the specified ports, but left the oob/tcp one 
unspecified. You need to add

oob_tcp_dynamic_ipv4_ports = 46207-46239

or whatever range you want to specify

Note that if you want the btl/tcp layer to use those other settings (e.g., 
keepalive_time), then you'll need to set those as well. The names of the 
variables may not match between the layers - you'll need to use ompi_info to 
find the names and params available for each layer.


On Mar 16, 2021, at 2:43 AM, Vincent via users mailto:users@lists.open-mpi.org> > wrote:

On 09/03/2021 11:23, Sendu Bala via users wrote:
When u

Re: [OMPI users] How do you change ports used? [EXT]

2021-03-18 Thread Ralph Castain via users
Hmmm...then you have something else going on. By default, OMPI will ask the 
local OS for an available port and use it. You only need to specify ports when 
working thru a firewall.

Do you have firewalls on this cluster?


On Mar 18, 2021, at 8:55 AM, Sendu Bala mailto:s...@sanger.ac.uk> > wrote:

Yes, that’s the trick. I’m going to have to check port usage on all hosts and 
pick suitable ranges just-in-time - and hope I don’t hit a race condition with 
other users of the cluster.

Does mpiexec not have this kind of functionality built in? When I use it with 
no port options set (pure default), it just doesn’t function (I’m guessing 
because it chose “bad” or in-use ports).



On 18 Mar 2021, at 14:11, Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote:

Hard to say - unless there is some reason, why not make it large enough to not 
be an issue? You may have to experiment a bit as there is nothing to guarantee 
that other processes aren't occupying those regions.



On Mar 18, 2021, at 2:13 AM, Sendu Bala mailto:s...@sanger.ac.uk> > wrote:

Thanks, it made it work when I was running “true” as a test, but then my real 
MPI app failed with:

[node-5-8-2][[48139,1],0][btl_tcp_component.c:966:mca_btl_tcp_component_create_listen]
 bind() failed: no port available in the range [46107..46139]
--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[48139,1],1]) is on host: node-12-6-2
  Process 2 ([[48139,1],0]) is on host: node-5-8-2
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.


This was when running with 16 cores, so I thought  a 32 port range would be 
fine. Is this telling me I have to make it a 33 port range, have different 
ranges for oob and btl, or that some other unrelated software is using some 
ports in my range?


(I changed my range from my previous post, because using that range resulted in 
the issue I posted about here before, where mpirun just does nothing for 5mins 
and then terminates itself, without any error messages.)


Cheers,
Sendu.


On 17 Mar 2021, at 13:25, Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote:

What you are missing is that there are _two_ messaging layers in the system. 
You told the btl/tcp layer to use the specified ports, but left the oob/tcp one 
unspecified. You need to add

oob_tcp_dynamic_ipv4_ports = 46207-46239

or whatever range you want to specify

Note that if you want the btl/tcp layer to use those other settings (e.g., 
keepalive_time), then you'll need to set those as well. The names of the 
variables may not match between the layers - you'll need to use ompi_info to 
find the names and params available for each layer.


On Mar 16, 2021, at 2:43 AM, Vincent via users mailto:users@lists.open-mpi.org> > wrote:

On 09/03/2021 11:23, Sendu Bala via users wrote:
When using mpirun, how do you pick which ports are used?

I???ve tried:

mpirun --mca btl_tcp_port_min_v4 46207  --mca btl_tcp_port_range_v4 32 --mca 
oob_tcp_keepalive_time 45 --mca oob_tcp_max_recon_attempts 20 --mca 
oob_tcp_retry_delay  1 --mca oob_tcp_keepalive_probes 20 --mca 
oob_tcp_keepalive_intvl 10 true

And also setting similar things in openmpi/etc/openmpi-mca-params.conf :

btl_tcp_port_min_v4 = 46207
btl_tcp_port_range_v4 = 32
oob_tcp_keepalive_time = 45
oob_tcp_max_recon_attempts = 20
oob_tcp_retry_delay = 1
oob_tcp_keepalive_probes = 20
oob_tcp_keepalive_intvl = 10

But when the process is running:

ss -l -p -n | grep "pid=57642,"
tcp  LISTEN 0  128    
127.0.0.1:58439 0.0.0.0:* users:(("mpirun",pid=57642,fd=14))
tcp  LISTEN 0  128  
0.0.0.0:36253 0.0.0.0:*   users:(("mpirun",pid=57642,fd=17))

What am I doing wrong, and how do I get it to use my desired ports (and other 
settings above)?


Hello

Could this be related to some recently resolved bug ?
What version are you running ?
Having a look on 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_open-2Dmpi_ompi_issues_8304&d=DwIFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=R4ZUzQZ7_TZ1SVV_pAmysrrJ1zatMHFpzMNAdJSpPIo&m=Dv6xQizR35EO5Xf86whFlO2mZWbJO9kT0iMDaeL0iXs&s=RhsRamUPqN_mfRS_JffG2ZAfqgCaYGL1Fkqbv1d3WB8&e=
  could be possibly useful?


Regards

Vincent.

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a 
charity registered in England with number 1021457 and a company registered in 
England with num

Re: [OMPI users] How do you change ports used? [EXT]

2021-03-18 Thread Ralph Castain via users
Hard to say - unless there is some reason, why not make it large enough to not 
be an issue? You may have to experiment a bit as there is nothing to guarantee 
that other processes aren't occupying those regions.



On Mar 18, 2021, at 2:13 AM, Sendu Bala mailto:s...@sanger.ac.uk> > wrote:

Thanks, it made it work when I was running “true” as a test, but then my real 
MPI app failed with:

[node-5-8-2][[48139,1],0][btl_tcp_component.c:966:mca_btl_tcp_component_create_listen]
 bind() failed: no port available in the range [46107..46139]
--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[48139,1],1]) is on host: node-12-6-2
  Process 2 ([[48139,1],0]) is on host: node-5-8-2
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.


This was when running with 16 cores, so I thought  a 32 port range would be 
fine. Is this telling me I have to make it a 33 port range, have different 
ranges for oob and btl, or that some other unrelated software is using some 
ports in my range?


(I changed my range from my previous post, because using that range resulted in 
the issue I posted about here before, where mpirun just does nothing for 5mins 
and then terminates itself, without any error messages.)


Cheers,
Sendu.


On 17 Mar 2021, at 13:25, Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote:

What you are missing is that there are _two_ messaging layers in the system. 
You told the btl/tcp layer to use the specified ports, but left the oob/tcp one 
unspecified. You need to add

oob_tcp_dynamic_ipv4_ports = 46207-46239

or whatever range you want to specify

Note that if you want the btl/tcp layer to use those other settings (e.g., 
keepalive_time), then you'll need to set those as well. The names of the 
variables may not match between the layers - you'll need to use ompi_info to 
find the names and params available for each layer.


On Mar 16, 2021, at 2:43 AM, Vincent via users mailto:users@lists.open-mpi.org> > wrote:

On 09/03/2021 11:23, Sendu Bala via users wrote:
When using mpirun, how do you pick which ports are used?

I???ve tried:

mpirun --mca btl_tcp_port_min_v4 46207  --mca btl_tcp_port_range_v4 32 --mca 
oob_tcp_keepalive_time 45 --mca oob_tcp_max_recon_attempts 20 --mca 
oob_tcp_retry_delay  1 --mca oob_tcp_keepalive_probes 20 --mca 
oob_tcp_keepalive_intvl 10 true

And also setting similar things in openmpi/etc/openmpi-mca-params.conf :

btl_tcp_port_min_v4 = 46207
btl_tcp_port_range_v4 = 32
oob_tcp_keepalive_time = 45
oob_tcp_max_recon_attempts = 20
oob_tcp_retry_delay = 1
oob_tcp_keepalive_probes = 20
oob_tcp_keepalive_intvl = 10

But when the process is running:

ss -l -p -n | grep "pid=57642,"
tcp  LISTEN 0  128    
127.0.0.1:58439 0.0.0.0:* users:(("mpirun",pid=57642,fd=14))
tcp  LISTEN 0  128  
0.0.0.0:36253 0.0.0.0:*   users:(("mpirun",pid=57642,fd=17))

What am I doing wrong, and how do I get it to use my desired ports (and other 
settings above)?


Hello

Could this be related to some recently resolved bug ?
What version are you running ?
Having a look on 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_open-2Dmpi_ompi_issues_8304&d=DwIFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=R4ZUzQZ7_TZ1SVV_pAmysrrJ1zatMHFpzMNAdJSpPIo&m=Dv6xQizR35EO5Xf86whFlO2mZWbJO9kT0iMDaeL0iXs&s=RhsRamUPqN_mfRS_JffG2ZAfqgCaYGL1Fkqbv1d3WB8&e=
  could be possibly useful?


Regards

Vincent.

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a 
charity registered in England with number 1021457 and a company registered in 
England with number 2742969, whose registered office is 215 Euston Road, 
London, NW1 2BE. 



Re: [OMPI users] How do you change ports used?

2021-03-17 Thread Ralph Castain via users
What you are missing is that there are _two_ messaging layers in the system. 
You told the btl/tcp layer to use the specified ports, but left the oob/tcp one 
unspecified. You need to add

oob_tcp_dynamic_ipv4_ports = 46207-46239

or whatever range you want to specify

Note that if you want the btl/tcp layer to use those other settings (e.g., 
keepalive_time), then you'll need to set those as well. The names of the 
variables may not match between the layers - you'll need to use ompi_info to 
find the names and params available for each layer.


> On Mar 16, 2021, at 2:43 AM, Vincent via users  
> wrote:
> 
> On 09/03/2021 11:23, Sendu Bala via users wrote:
>> When using mpirun, how do you pick which ports are used?
>> 
>> I???ve tried:
>> 
>> mpirun --mca btl_tcp_port_min_v4 46207  --mca btl_tcp_port_range_v4 32 --mca 
>> oob_tcp_keepalive_time 45 --mca oob_tcp_max_recon_attempts 20 --mca 
>> oob_tcp_retry_delay  1 --mca oob_tcp_keepalive_probes 20 --mca 
>> oob_tcp_keepalive_intvl 10 true
>> 
>> And also setting similar things in openmpi/etc/openmpi-mca-params.conf :
>> btl_tcp_port_min_v4 = 46207
>> btl_tcp_port_range_v4 = 32
>> oob_tcp_keepalive_time = 45
>> oob_tcp_max_recon_attempts = 20
>> oob_tcp_retry_delay = 1
>> oob_tcp_keepalive_probes = 20
>> oob_tcp_keepalive_intvl = 10
>> 
>> But when the process is running:
>> 
>> ss -l -p -n | grep "pid=57642,"
>> tcp  LISTEN 0  128
>> 127.0.0.1:58439 0.0.0.0:*   
>> users:(("mpirun",pid=57642,fd=14))
>> tcp  LISTEN 0  128  
>> 0.0.0.0:36253 0.0.0.0:*   
>> users:(("mpirun",pid=57642,fd=17))
>> 
>> What am I doing wrong, and how do I get it to use my desired ports (and 
>> other settings above)?
>> 
>> 
> Hello
> 
> Could this be related to some recently resolved bug ?
> What version are you running ?
> Having a look on https://github.com/open-mpi/ompi/issues/8304 could be 
> possibly useful?
> 
> Regards
> 
> Vincent.
> 



Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 ?

2021-03-04 Thread Ralph Castain via users
Excuse me, but would you please ensure that you do not send mail to a mailing 
list containing this label:

[AMD Official Use Only - Internal Distribution Only]

Thank you
Ralph


On Mar 4, 2021, at 4:55 AM, Raut, S Biplab via users mailto:users@lists.open-mpi.org> > wrote:

[AMD Official Use Only - Internal Distribution Only]
 It is a single node execution, so it should be using shared memory (vader).
 With Regards,
S. Biplab Raut
 From: Heinz, Michael William mailto:michael.william.he...@cornelisnetworks.com> > 
Sent: Thursday, March 4, 2021 5:17 PM
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Raut, S Biplab mailto:biplab.r...@amd.com> >
Subject: Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 
?
 [CAUTION: External Email] 
What interconnect are you using at run time? That is, are you using Ethernet or 
InfiniBand or Omnipath?

Sent from my iPad
 
On Mar 4, 2021, at 5:05 AM, Raut, S Biplab via users mailto:users@lists.open-mpi.org> > wrote:

 
[AMD Official Use Only - Internal Distribution Only]
 After downloading a particular openMPI version, let’s say v3.1.1 from 
https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.1.tar.gz, I 
follow the below steps.
./configure --prefix="$INSTALL_DIR" --enable-mpi-fortran --enable-mpi-cxx 
--enable-shared=yes --enable-static=yes --enable-mpi1-compatibility
  make -j
  make install
  export PATH=$INSTALL_DIR/bin:$PATH
  export LD_LIBRARY_PATH=$INSTALL_DIR/lib:$LD_LIBRARY_PATH
Additionally, I also install libnuma-dev on the machine.
 For all the machines having Ubuntu 18.04 and 19.04, it works correctly and 
results in expected performance/GFLOPS.
But, when OS is changed to Ubuntu 20.04, then I start getting the issues as 
mentioned in my original/previous mail below.
 With Regards,
S. Biplab Raut
 From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of John Hearns via users

Sent: Thursday, March 4, 2021 1:53 PM
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: John Hearns mailto:hear...@gmail.com> >
Subject: Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 
?
 [CAUTION: External Email] 
How are you installing the OpenMPI versions? Are you using packages which are 
distributed by the OS?
 It might be worth looking at using Easybuid or Spack
https://docs.easybuild.io/en/latest/Introduction.html
https://spack.readthedocs.io/en/latest/
  On Thu, 4 Mar 2021 at 07:35, Raut, S Biplab via users 
mailto:users@lists.open-mpi.org> > wrote:
[AMD Official Use Only - Internal Distribution Only]
 Dear Experts,
    Until recently, I was using openMPI3.1.1 to run single 
node 128 ranks MPI application on Ubuntu18.04 and Ubuntu19.04.
But, now the OS on these machines are upgraded to Ubuntu20.04, and I have been 
observing program hangs with openMPI3.1.1 version.
So, I tried with openMPI4.0.5 version – The program ran properly without any 
issues but there is a performance regression in my application.
 Can I know the stable openMPI version recommended for Ubuntu20.04 that has no 
known regression compared to v3.1.1.
 With Regards,
S. Biplab Raut



Re: [OMPI users] Mapping, binding and ranking

2021-03-01 Thread Ralph Castain via users
Sounds like a bug in that release - you may have to wait for OMPI v5.0 for a 
fix.

On Mar 1, 2021, at 7:43 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:

 
 I am afraid --map-by ppr:32:socket --bind-to core --cpu-list 0,2,4,6... 
somehow conflicts internally with other policies. I have also tried with 
--cpu-set with identical results. Probably rankfile is my only option too. 
 
 
On 28/02/2021 22:44, Ralph Castain via users wrote:
 
 The only way I know of to do what you want is 

 
 
--map-by ppr:32:socket --bind-to core --cpu-list 0,2,4,6,...
 

 
 
where you list out the exact cpus you want to use.
 

 

 
On Feb 28, 2021, at 9:58 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:
 
 
 
 I could do --map-by ppr:32:socket:PE=1 --bind-to core (output below) but I 
cannot see the way of mapping every 2 cores 0,2,4,
 
  [epsilon110:1489563] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
 ../../../../../../../../../../../../../../../../../..]
 [epsilon110:1489563] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: 
[../BB/../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
 ../../../../../../../../../../../../../../../../../..]
 
 
On 28/02/2021 16:24, Ralph Castain via users wrote:
 
 Did you read the documentation on rankfile? The "slot=N" directive saids to 
"put this proc on core N". In your file, you stipulate that 

 
 
rank 0 is to be placed solely on core 0
 
rank 1 is to be placed solely on core 2
 
etc.
 

 
 
That is not what you asked for in your mpirun cmd. You asked that each proc be 
mapped to TWO cores (PE=2) or FOUR threads (PE=4 with bind-to HWT). If you 
wanted that same thing in a rankfile, it should have said
 

 
 
rank 0 slots=0-1
 
rank 1 slots=2-3
 
etc.
 

 
 
Hence the difference. I was simply correcting your mpirun cmd line as you said 
you wanted two CORES, and that isn't guaranteed if you are stipulating things 
in terms of HWTs as not every machine has two HWTs/core.
 

 
 

 

 
On Feb 28, 2021, at 7:43 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:
 
 
 
 Hi Ralph,
 
 Thanks for this, however --map-by ppr:32:socket:PE=2 --bind-to core reports 
the same binding than --map-by ppr:32:socket:PE=4 --bind-to hwthread:
 
 [epsilon104:2861230] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]]: [BB/BB/../../../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 [epsilon104:2861230] MCW rank 1 bound to socket 0[core 2[hwt 0-1]], socket 
0[core 3[hwt 0-1]]: [../../BB/BB/../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 [epsilon104:2861230] MCW rank 2 bound to socket 0[core 4[hwt 0-1]], socket 
0[core 5[hwt 0-1]]: [../../../../BB/BB/
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 
 And this is still different from the output produce using the rankfile.

 
 Cheers,
 Luis
 
 
On 28/02/2021 14:06, Ralph Castain via users wrote:
 
 Your command line is incorrect: 

 
 
--map-by ppr:32:socket:PE=4 --bind-to hwthread
 

 
 
should be
 

 
 
--map-by ppr:32:socket:PE=2 --bind-to core
 

 
 

 

 
On Feb 28, 2021, at 5:57 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:
 
 
 
 
I should have said, "I would like to run 128 MPI processes on 2 nodes" and not 
64 like I initially said...
 
 
 
On Sat, 27 Feb 2021, 15:03 Luis Cebamanos, mailto:luic...@gmail

Re: [OMPI users] Mapping, binding and ranking

2021-03-01 Thread Ralph Castain via users
Listing each rank - yes, I'm afraid so. Someone might someday choose to extend 
cpu-list to do what you seek, but I don't see any alternative to rankfile at 
this time.

On Mar 1, 2021, at 7:32 AM, John R Cary via users mailto:users@lists.open-mpi.org> > wrote:

 
 Thanks, Ralph.  So then I need a rankfile listing all the hosts?

 
 John
 
 
On 3/1/21 10:26 AM, Ralph Castain via users wrote:
 
 I'm afraid not - you have simply told us that all cpus are available. I don't 
know of any way to accomplish what John wants other than with a rankfile.
 

 
On Mar 1, 2021, at 7:13 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:
 
 
 
 Hi John,
 
 I would be interested to know if that does what you are expecting...
 
 
On 01/03/2021 00:02, John R Cary via users wrote:
 
 I've been watching this exchange with interest, because it is the
 closest I have seen to what I want, but I want something slightly
 different: 2 processes per node, with the first one bound to one core,
 and the second bound to all the rest, with no use of hyperthreads.
 
 Would this be
 
 --map-by ppr:2:node --bind-to core --cpu-list 0,1-31
 
 ?
 
 Thx
 
 
 
On 2/28/21 5:44 PM, Ralph Castain via users wrote:
 
 The only way I know of to do what you want is 

 
 
--map-by ppr:32:socket --bind-to core --cpu-list 0,2,4,6,...
 

 
 
where you list out the exact cpus you want to use.
 

 

 
On Feb 28, 2021, at 9:58 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:
 
 
 
 I could do --map-by ppr:32:socket:PE=1 --bind-to core (output below) but I 
cannot see the way of mapping every 2 cores 0,2,4,
 
  [epsilon110:1489563] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../..]
 [epsilon110:1489563] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: 
[../BB/../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../..]
 
 
On 28/02/2021 16:24, Ralph Castain via users wrote:
 
 Did you read the documentation on rankfile? The "slot=N" directive saids to 
"put this proc on core N". In your file, you stipulate that 

 
 
rank 0 is to be placed solely on core 0
 
rank 1 is to be placed solely on core 2
 
etc.
 

 
 
That is not what you asked for in your mpirun cmd. You asked that each proc be 
mapped to TWO cores (PE=2) or FOUR threads (PE=4 with bind-to HWT). If you 
wanted that same thing in a rankfile, it should have said
 

 
 
rank 0 slots=0-1
 
rank 1 slots=2-3
 
etc.
 

 
 
Hence the difference. I was simply correcting your mpirun cmd line as you said 
you wanted two CORES, and that isn't guaranteed if you are stipulating things 
in terms of HWTs as not every machine has two HWTs/core.
 

 
 

 

 
On Feb 28, 2021, at 7:43 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:
 
 
 
 Hi Ralph,
 
 Thanks for this, however --map-by ppr:32:socket:PE=2 --bind-to core reports 
the same binding than --map-by ppr:32:socket:PE=4 --bind-to hwthread:
 
 [epsilon104:2861230] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]]: [BB/BB/../../../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 [epsilon104:2861230] MCW rank 1 bound to socket 0[core 2[hwt 0-1]], socket 
0[core 3[hwt 0-1]]: [../../BB/BB/../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 [epsilon104:2861230] MCW rank 2 bound to

Re: [OMPI users] Mapping, binding and ranking

2021-03-01 Thread Ralph Castain via users
I'm afraid not - you have simply told us that all cpus are available. I don't 
know of any way to accomplish what John wants other than with a rankfile.

On Mar 1, 2021, at 7:13 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:

 
 Hi John,
 
 I would be interested to know if that does what you are expecting...
 
 
On 01/03/2021 00:02, John R Cary via users wrote:
 
 I've been watching this exchange with interest, because it is the
 closest I have seen to what I want, but I want something slightly
 different: 2 processes per node, with the first one bound to one core,
 and the second bound to all the rest, with no use of hyperthreads.
 
 Would this be
 
 --map-by ppr:2:node --bind-to core --cpu-list 0,1-31
 
 ?
 
 Thx
 
 
 
On 2/28/21 5:44 PM, Ralph Castain via users wrote:
 
 The only way I know of to do what you want is 

 
 
--map-by ppr:32:socket --bind-to core --cpu-list 0,2,4,6,...
 

 
 
where you list out the exact cpus you want to use.
 

 

 
On Feb 28, 2021, at 9:58 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:
 
 
 
 I could do --map-by ppr:32:socket:PE=1 --bind-to core (output below) but I 
cannot see the way of mapping every 2 cores 0,2,4,
 
  [epsilon110:1489563] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
 ../../../../../../../../../../../../../../../../../..]
 [epsilon110:1489563] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: 
[../BB/../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
 ../../../../../../../../../../../../../../../../../..]
 
 
On 28/02/2021 16:24, Ralph Castain via users wrote:
 
 Did you read the documentation on rankfile? The "slot=N" directive saids to 
"put this proc on core N". In your file, you stipulate that 

 
 
rank 0 is to be placed solely on core 0
 
rank 1 is to be placed solely on core 2
 
etc.
 

 
 
That is not what you asked for in your mpirun cmd. You asked that each proc be 
mapped to TWO cores (PE=2) or FOUR threads (PE=4 with bind-to HWT). If you 
wanted that same thing in a rankfile, it should have said
 

 
 
rank 0 slots=0-1
 
rank 1 slots=2-3
 
etc.
 

 
 
Hence the difference. I was simply correcting your mpirun cmd line as you said 
you wanted two CORES, and that isn't guaranteed if you are stipulating things 
in terms of HWTs as not every machine has two HWTs/core.
 

 
 

 

 
On Feb 28, 2021, at 7:43 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:
 
 
 
 Hi Ralph,
 
 Thanks for this, however --map-by ppr:32:socket:PE=2 --bind-to core reports 
the same binding than --map-by ppr:32:socket:PE=4 --bind-to hwthread:
 
 [epsilon104:2861230] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]]: [BB/BB/../../../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 [epsilon104:2861230] MCW rank 1 bound to socket 0[core 2[hwt 0-1]], socket 
0[core 3[hwt 0-1]]: [../../BB/BB/../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 [epsilon104:2861230] MCW rank 2 bound to socket 0[core 4[hwt 0-1]], socket 
0[core 5[hwt 0-1]]: [../../../../BB/BB/
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 
 And this is still different from the output produce using the rankfile.

 
 Cheers,
 Luis
 
 
On 28/02/2021 14:06, Ralph Castain via users wrote:
 
 Your command line is incorrect: 

 
 
--map-by ppr:32:socket:PE=4 --bind-to hwthread
 

 
 
s

Re: [OMPI users] Mapping, binding and ranking

2021-02-28 Thread Ralph Castain via users
The only way I know of to do what you want is

--map-by ppr:32:socket --bind-to core --cpu-list 0,2,4,6,...

where you list out the exact cpus you want to use.


On Feb 28, 2021, at 9:58 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:

 
 I could do --map-by ppr:32:socket:PE=1 --bind-to core (output below) but I 
cannot see the way of mapping every 2 cores 0,2,4,
 
  [epsilon110:1489563] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
 ../../../../../../../../../../../../../../../../../..]
 [epsilon110:1489563] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: 
[../BB/../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
 ../../../../../../../../../../../../../../../../../..]
 
 
On 28/02/2021 16:24, Ralph Castain via users wrote:
 
 Did you read the documentation on rankfile? The "slot=N" directive saids to 
"put this proc on core N". In your file, you stipulate that 

 
 
rank 0 is to be placed solely on core 0
 
rank 1 is to be placed solely on core 2
 
etc.
 

 
 
That is not what you asked for in your mpirun cmd. You asked that each proc be 
mapped to TWO cores (PE=2) or FOUR threads (PE=4 with bind-to HWT). If you 
wanted that same thing in a rankfile, it should have said
 

 
 
rank 0 slots=0-1
 
rank 1 slots=2-3
 
etc.
 

 
 
Hence the difference. I was simply correcting your mpirun cmd line as you said 
you wanted two CORES, and that isn't guaranteed if you are stipulating things 
in terms of HWTs as not every machine has two HWTs/core.
 

 
 

 

 
On Feb 28, 2021, at 7:43 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:
 
 
 
 Hi Ralph,
 
 Thanks for this, however --map-by ppr:32:socket:PE=2 --bind-to core reports 
the same binding than --map-by ppr:32:socket:PE=4 --bind-to hwthread:
 
 [epsilon104:2861230] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]]: [BB/BB/../../../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 [epsilon104:2861230] MCW rank 1 bound to socket 0[core 2[hwt 0-1]], socket 
0[core 3[hwt 0-1]]: [../../BB/BB/../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 [epsilon104:2861230] MCW rank 2 bound to socket 0[core 4[hwt 0-1]], socket 
0[core 5[hwt 0-1]]: [../../../../BB/BB/
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 
 And this is still different from the output produce using the rankfile.

 
 Cheers,
 Luis
 
 
On 28/02/2021 14:06, Ralph Castain via users wrote:
 
 Your command line is incorrect: 

 
 
--map-by ppr:32:socket:PE=4 --bind-to hwthread
 

 
 
should be
 

 
 
--map-by ppr:32:socket:PE=2 --bind-to core
 

 
 

 

 
On Feb 28, 2021, at 5:57 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:
 
 
 
 
I should have said, "I would like to run 128 MPI processes on 2 nodes" and not 
64 like I initially said...
 
 
 
On Sat, 27 Feb 2021, 15:03 Luis Cebamanos, mailto:luic...@gmail.com> > wrote:
 
 Hello OMPI users,
 
 On 128 core nodes, 2 sockets x 64 cores/socket (2 hwthreads/core) , I am 
 trying to match the behavior of running with a rankfile with manual 
 mapping/ranking/binding.
 
 I would like to run 64 MPI processes on 2 nodes, 1 MPI process every 2 
 cores. This is, I want to run 32 MPI processes per socket on 2 128-core 
 nodes. My mapping should be something like:
 
 Node 0
 =
 rank 0  -  core 0
 rank 1  -  core 2
 rank 3 -   core 4
 ...
 rank 6

Re: [OMPI users] Mapping, binding and ranking

2021-02-28 Thread Ralph Castain via users
Did you read the documentation on rankfile? The "slot=N" directive saids to 
"put this proc on core N". In your file, you stipulate that

rank 0 is to be placed solely on core 0
rank 1 is to be placed solely on core 2
etc.

That is not what you asked for in your mpirun cmd. You asked that each proc be 
mapped to TWO cores (PE=2) or FOUR threads (PE=4 with bind-to HWT). If you 
wanted that same thing in a rankfile, it should have said

rank 0 slots=0-1
rank 1 slots=2-3
etc.

Hence the difference. I was simply correcting your mpirun cmd line as you said 
you wanted two CORES, and that isn't guaranteed if you are stipulating things 
in terms of HWTs as not every machine has two HWTs/core.



On Feb 28, 2021, at 7:43 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:

 
 Hi Ralph,
 
 Thanks for this, however --map-by ppr:32:socket:PE=2 --bind-to core reports 
the same binding than --map-by ppr:32:socket:PE=4 --bind-to hwthread:
 
 [epsilon104:2861230] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]]: [BB/BB/../../../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 [epsilon104:2861230] MCW rank 1 bound to socket 0[core 2[hwt 0-1]], socket 
0[core 3[hwt 0-1]]: [../../BB/BB/../../
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 [epsilon104:2861230] MCW rank 2 bound to socket 0[core 4[hwt 0-1]], socket 
0[core 5[hwt 0-1]]: [../../../../BB/BB/
../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
 /../../../../../../../..]
 
 And this is still different from the output produce using the rankfile.

 
 Cheers,
 Luis
 
 
On 28/02/2021 14:06, Ralph Castain via users wrote:
 
 Your command line is incorrect: 

 
 
--map-by ppr:32:socket:PE=4 --bind-to hwthread
 

 
 
should be
 

 
 
--map-by ppr:32:socket:PE=2 --bind-to core
 

 
 

 

 
On Feb 28, 2021, at 5:57 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:
 
 
 
 
I should have said, "I would like to run 128 MPI processes on 2 nodes" and not 
64 like I initially said...
 
 
 
On Sat, 27 Feb 2021, 15:03 Luis Cebamanos, mailto:luic...@gmail.com> > wrote:
 
 Hello OMPI users,
 
 On 128 core nodes, 2 sockets x 64 cores/socket (2 hwthreads/core) , I am 
 trying to match the behavior of running with a rankfile with manual 
 mapping/ranking/binding.
 
 I would like to run 64 MPI processes on 2 nodes, 1 MPI process every 2 
 cores. This is, I want to run 32 MPI processes per socket on 2 128-core 
 nodes. My mapping should be something like:
 
 Node 0
 =
 rank 0  -  core 0
 rank 1  -  core 2
 rank 3 -   core 4
 ...
 rank 63 - core 126
 
 
 Node 1
 
 rank 64  -  core 0
 rank 65  -  core 2
 rank 66 -   core 4
 ...
 rank 127- core 126
 
 If I use a rankfile:
 rank 0=epsilon102 slot=0
 rank 1=epsilon102 slot=2
 rank 2=epsilon102 slot=4
 rank 3=epsilon102 slot=6
 rank 4=epsilon102 slot=8
 rank 5=epsilon102slot=10
 
 rank 123=epsilon103 slot=118
 rank 124=epsilon103 slot=120
 rank 125=epsilon103 slot=122
 rank 126=epsilon103 slot=124
 rank 127=epsilon103 slot=126
 
 My --report-binding looks like:
 
 [epsilon102:2635370] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
 [BB/../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../..]
 [epsilon102:2635370] MCW rank 1 bound to socket 0[core 2[hwt 0-1]]: 
 [../../BB/..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../..]
 [epsilon102

Re: [OMPI users] Mapping, binding and ranking

2021-02-28 Thread Ralph Castain via users
Your command line is incorrect:

--map-by ppr:32:socket:PE=4 --bind-to hwthread

should be

--map-by ppr:32:socket:PE=2 --bind-to core



On Feb 28, 2021, at 5:57 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:

I should have said, "I would like to run 128 MPI processes on 2 nodes" and not 
64 like I initially said...

On Sat, 27 Feb 2021, 15:03 Luis Cebamanos, mailto:luic...@gmail.com> > wrote:
Hello OMPI users,

On 128 core nodes, 2 sockets x 64 cores/socket (2 hwthreads/core) , I am 
trying to match the behavior of running with a rankfile with manual 
mapping/ranking/binding.

I would like to run 64 MPI processes on 2 nodes, 1 MPI process every 2 
cores. This is, I want to run 32 MPI processes per socket on 2 128-core 
nodes. My mapping should be something like:

Node 0
=
rank 0  -  core 0
rank 1  -  core 2
rank 3 -   core 4
...
rank 63 - core 126


Node 1

rank 64  -  core 0
rank 65  -  core 2
rank 66 -   core 4
...
rank 127- core 126

If I use a rankfile:
rank 0=epsilon102 slot=0
rank 1=epsilon102 slot=2
rank 2=epsilon102 slot=4
rank 3=epsilon102 slot=6
rank 4=epsilon102 slot=8
rank 5=epsilon102slot=10

rank 123=epsilon103 slot=118
rank 124=epsilon103 slot=120
rank 125=epsilon103 slot=122
rank 126=epsilon103 slot=124
rank 127=epsilon103 slot=126

My --report-binding looks like:

[epsilon102:2635370] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../..]
[epsilon102:2635370] MCW rank 1 bound to socket 0[core 2[hwt 0-1]]: 
[../../BB/..
/../../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../..]
[epsilon102:2635370] MCW rank 2 bound to socket 0[core 4[hwt 0-1]]: 
[../../../..
/BB/../../../../../../../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../..][../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../..]


However, I cannot match this report-binding output by manually using 
--map-by and --bind-to. I had the impression that this will be the same:


mpirun -np $SLURM_NTASKS  --report-bindings --map-by ppr:32:socket:PE=4 
--bind-to hwthread

But this output is not quite the same:

[epsilon102:2631529] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], 
socket 0[cor
e 1[hwt 0-1]]: 
[BB/BB/../../../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../..][../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[epsilon102:2631529] MCW rank 1 bound to socket 0[core 2[hwt 0-1]], 
socket 0[cor
e 3[hwt 0-1]]: 
[../../BB/BB/../../../../../../../../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../..][../../../../../../../../../../.
./../../../../../../../../../../../../../../../../../../../../../../../../../../
../../../../../../../../../../../../../../../../../../../../../../../../../../..]

What am I missing to match the rankfile behavior? Regarding performance, 
what difference does it make between the first and the second outputs?


Thanks for your help!
Luis



Re: [OMPI users] Binding blocks of processes in round-robin fashion

2021-01-29 Thread Ralph Castain via users
Okay, I can't promise when I'll get to it, but I'll try to have it in time for 
OMPI v5.


On Jan 29, 2021, at 1:30 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:

 
 Hi Ralph,
 
 It would be great to have it for load balancing issues. Ideally one could do 
something like --bind-to:N where N is the block size, 4 in this case.
 
 mpirun -np 40  --map-by ppr:40:node  --bind-to core:4  
 
 I think it would be interesting to have it. Of course, I can always use srun 
but not all system run Slurm.
 
 
 
Of course, you could fake it out even today by breaking it into multiple 
app-contexts on the cmd line. Something like this (again, shortening it to just 
two nodes):
 

 
 
mpirun --map-by node --rank-by slot --bind-to core --np 8 myapp : --np 8 myapp 
: --np 8 myapp : --np 8 myapp : --np 8 myapp
 

 
 It is a valid options, tedious for large number of nodes though. 
 
 Thanks!
 



Re: [OMPI users] Binding blocks of processes in round-robin fashion

2021-01-28 Thread Ralph Castain via users
Hmmm...well, the proc distribution is easy as you would just --map-by node. The 
tricky thing is assigning the ranks in the pattern you desire. We definitely 
don't have that pattern in our ranking algo today, though it wouldn't be hard 
to add.

However, that wouldn't be available until OMPI v5 was released later this year. 
Are you going to be using this long-term enough to warrant a dedicated option?

Of course, you could fake it out even today by breaking it into multiple 
app-contexts on the cmd line. Something like this (again, shortening it to just 
two nodes):

mpirun --map-by node --rank-by slot --bind-to core --np 8 myapp : --np 8 myapp 
: --np 8 myapp : --np 8 myapp : --np 8 myapp

The way our mapper currently works, it will process the app-contexts in order. 
So I _think_ this will get what you want - might be worth a try. Kinda ugly, I 
know - but it might work, and all the app-contexts wind up in MPI_COMM_WORLD.


On Jan 28, 2021, at 3:18 PM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:

That's right Ralph!

On 28/01/2021 23:13, Ralph Castain via users wrote:
Trying to wrap my head around this, so let me try a 2-node example. You want 
(each rank bound to a single core):

ranks 0-3 to be mapped onto node1
ranks 4-7 to be mapped onto node2
ranks 8-11 to be mapped onto node1
ranks 12-15 to be mapped onto node2
etc.etc.

Correct?

On Jan 28, 2021, at 3:00 PM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote:

Hello all,

What are the options for binding MPI tasks on a blocks of cores per 
node/socket/numa in a round-robin fashion? Say I want to fully populate 40 core 
sockets on dual-socket nodes but in a round-robin fashion binding 4 cores on 
the first node, then 4 cores on the next, and so on.  Would be ``--bind-to 
core``?

srun can do this with ``distribution=plane`` so one could do ``srun 
--distribution=plane=4``.

cheers







Re: [OMPI users] Binding blocks of processes in round-robin fashion

2021-01-28 Thread Ralph Castain via users
Trying to wrap my head around this, so let me try a 2-node example. You want 
(each rank bound to a single core):

ranks 0-3 to be mapped onto node1
ranks 4-7 to be mapped onto node2
ranks 8-11 to be mapped onto node1
ranks 12-15 to be mapped onto node2
etc.etc.

Correct?

> On Jan 28, 2021, at 3:00 PM, Luis Cebamanos via users 
>  wrote:
> 
> Hello all,
> 
> What are the options for binding MPI tasks on a blocks of cores per 
> node/socket/numa in a round-robin fashion? Say I want to fully populate 40 
> core sockets on dual-socket nodes but in a round-robin fashion binding 4 
> cores on the first node, then 4 cores on the next, and so on.  Would be 
> ``--bind-to core``?
> 
> srun can do this with ``distribution=plane`` so one could do ``srun 
> --distribution=plane=4``.
> 
> cheers
> 
> 




Re: [OMPI users] MCA parameter "orte_base_help_aggregate"

2021-01-25 Thread Ralph Castain via users
There should have been an error message right above that - all this is saying 
is that the same error message was output by 7 more processes besides the one 
that was output. It then indicates that process 3 (which has pid 0?) was killed.

Looking at the help message tag, it looks like no NICs were found on the host. 
You might want to post the full error output.

> On Jan 25, 2021, at 10:25 AM, Paul Cizmas via users 
>  wrote:
> 
> Hello:
> 
> I am testing a rather large code on several computers.  It works fine on all 
> except for a Linux Pop!_OS machine.  I tried both OpenMPI 2.1.1 and 4.0.5.  I 
> fear there is an issue because of the Pop!_OS but before I contact System76 I 
> would like to explore things further.
> 
> I get the following message while running the code on a box called jp1:
> 
> [jp1:3331418] 7 more processes have sent help message help-mpi-btl-base.txt / 
> btl:no-nics
> [jp1:3331418] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages
> 
> and then
> 
> mpirun noticed that process rank 3 with PID 0 on node jp1 exited on signal 9 
> (Killed).
> 
> It seems I should set this MCA parameter "orte_base_help_aggregate" to 0 in 
> order to see the error messages.
> 
> How can I do this?  I suppose I should do it before running the code.  Is 
> this correct?
> 
> Thank you,
> Paul




Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Ralph Castain via users
I think you mean add "--mca mtl ofi" to the mpirun cmd line


> On Jan 25, 2021, at 10:18 AM, Heinz, Michael William via users 
>  wrote:
> 
> What happens if you specify -mtl ofi ?
> 
> -Original Message-
> From: users  On Behalf Of Patrick Begou via 
> users
> Sent: Monday, January 25, 2021 12:54 PM
> To: users@lists.open-mpi.org
> Cc: Patrick Begou 
> Subject: Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path
> 
> Hi Howard and Michael,
> 
> thanks for your feedback. I did not want to write a toot long mail with non 
> pertinent information so I just show how the two different builds give 
> different result. I'm using a small test case based on my large code, the 
> same used to show the memory leak with mpi_Alltoallv calls, but just running 
> 2 iterations. It is a 2D case and data storage is moved from distributions 
> "along X axis" to "along Y axis" with mpi_Alltoallv and subarrays types. 
> Datas initialization is based on the location in the array to allow checking 
> for correct exchanges.
> 
> When the program runs (on 4 processes in my test) it must only show the max 
> rss size of the processes. When it fails it shows the invalid locations. I've 
> drastically reduced the size of the problem with nx=5 and ny=7.
> 
> Launching the non working setup with more details show:
> 
> dahu138 : mpirun -np 4 -mca mtl_base_verbose 99 ./test_layout_array 
> [dahu138:115761] mca: base: components_register: registering framework mtl 
> components [dahu138:115763] mca: base: components_register: registering 
> framework mtl components [dahu138:115763] mca: base: components_register: 
> found loaded component psm2 [dahu138:115763] mca: base: components_register: 
> component psm2 register function successful [dahu138:115763] mca: base: 
> components_open: opening mtl components [dahu138:115763] mca: base: 
> components_open: found loaded component psm2 [dahu138:115761] mca: base: 
> components_register: found loaded component psm2 [dahu138:115763] mca: base: 
> components_open: component psm2 open function successful [dahu138:115761] 
> mca: base: components_register: component psm2 register function successful 
> [dahu138:115761] mca: base: components_open: opening mtl components 
> [dahu138:115761] mca: base: components_open: found loaded component psm2 
> [dahu138:115761] mca: base: components_open: component psm2 open function 
> successful [dahu138:115760] mca: base: components_register: registering 
> framework mtl components [dahu138:115760] mca: base: components_register: 
> found loaded component psm2 [dahu138:115760] mca: base: components_register: 
> component psm2 register function successful [dahu138:115760] mca: base: 
> components_open: opening mtl components [dahu138:115760] mca: base: 
> components_open: found loaded component psm2 [dahu138:115762] mca: base: 
> components_register: registering framework mtl components [dahu138:115762] 
> mca: base: components_register: found loaded component psm2 [dahu138:115760] 
> mca: base: components_open: component psm2 open function successful 
> [dahu138:115762] mca: base: components_register: component psm2 register 
> function successful [dahu138:115762] mca: base: components_open: opening mtl 
> components [dahu138:115762] mca: base: components_open: found loaded 
> component psm2 [dahu138:115762] mca: base: components_open: component psm2 
> open function successful [dahu138:115760] mca:base:select: Auto-selecting mtl 
> components [dahu138:115760] mca:base:select:(  mtl) Querying component [psm2] 
> [dahu138:115760] mca:base:select:(  mtl) Query of component [psm2] set 
> priority to 40 [dahu138:115761] mca:base:select: Auto-selecting mtl 
> components [dahu138:115762] mca:base:select: Auto-selecting mtl components 
> [dahu138:115762] mca:base:select:(  mtl) Querying component [psm2] 
> [dahu138:115762] mca:base:select:(  mtl) Query of component [psm2] set 
> priority to 40 [dahu138:115762] mca:base:select:(  mtl) Selected component 
> [psm2] [dahu138:115762] select: initializing mtl component psm2 
> [dahu138:115761] mca:base:select:(  mtl) Querying component [psm2] 
> [dahu138:115761] mca:base:select:(  mtl) Query of component [psm2] set 
> priority to 40 [dahu138:115761] mca:base:select:(  mtl) Selected component 
> [psm2] [dahu138:115761] select: initializing mtl component psm2 
> [dahu138:115760] mca:base:select:(  mtl) Selected component [psm2] 
> [dahu138:115760] select: initializing mtl component psm2 [dahu138:115763] 
> mca:base:select: Auto-selecting mtl components [dahu138:115763] 
> mca:base:select:(  mtl) Querying component [psm2] [dahu138:115763] 
> mca:base:select:(  mtl) Query of component [psm2] set priority to 40 
> [dahu138:115763] mca:base:select:(  mtl) Selected component [psm2] 
> [dahu138:115763] select: initializing mtl component psm2 [dahu138:115761] 
> select: init returned success [dahu138:115761] select: component psm2 
> selected [dahu138:115762] select: init returned success [dahu138:115762] 
> select: com

Re: [OMPI users] MPMD hostfile: executables on same hosts

2020-12-21 Thread Ralph Castain via users
You want to use the "sequential" mapper and then specify each proc's location, 
like this for your hostfile:

host1
host1
host2
host2
host3
host3
host1
host2
host3

and then add "--mca rmaps seq" to your mpirun cmd line.

Ralph


On Dec 21, 2020, at 5:22 AM, Vineet Soni via users mailto:users@lists.open-mpi.org> > wrote:

Hello,

I'm having touble using the MPMD hostfile in which I want to place 2 
executables on the same nodes.

For example, I can do this using Intel MPI by:
$ mpirun -machine host-file -n 6 ./EXE1 : -n 3 ./EXE2
$ cat host-file
host1:2
host2:2
host3:2
host1:1
host2:1
host3:1

This would place 2 MPI processes of EXE1 and 1 MPI process of EXE2 on host1.

However, I get an error if I define the same hostname twice in the hostfile of 
OpenMPI:
$ mpirun -hostfile host-file -np 6 ./EXE1 : -np 3 ./EXE2
$ cat host-file
host1 slots=2 max_slots=3
host2 slots=2 max_slots=3
host3 slots=2 max_slots=3
host1 slots=1 max_slots=3
host2 slots=1 max_slots=3
host3 slots=1 max_slots=3

Is there a way to place both executables on the same hosts using a hostfile?

Thanks in advance.

Best,
Vineet



Re: [OMPI users] pmi.h/pmi2.h found but libpmi/libpmi missing

2020-12-20 Thread Ralph Castain via users
Did you remember to build the Slurm pmi and pmi2 libraries?  They aren't built 
by default - IIRC, you have to manually go into a subdirectory and do a "make 
install" to have them built and installed. You might check the Slurm 
documentation for details.

You also might need to add a --with-pmi-libdir=... to the OMPI configure line - 
IIRC, Slurm puts those libraries in some subdirectory that may not be 
immediately obvious.


> On Dec 20, 2020, at 6:52 AM, Philipp Heckmann via users 
>  wrote:
> 
> Hello everyone,
> 
> 
> I'm new to this list and as well to OpenMPI. I hope this is not a stupid
> question:
> 
> I use following machine:
> Operating System: CentOS Linux 7 (Core)
> CPE OS Name: cpe:/o:centos:centos:7
> Kernel: Linux 3.10.0-1127.19.1.el7.x86_64
> Architecture: x86-64
> 
> In the FAQ I found this post:
> 
> Does Open MPI support "srun -n X my_mpi_application"?
> 
> This is exactly what I want to achieve, starting MPI-Jobs using slurms
> run or sbatch without the mpirun command. So I installed slurm 20.02.5
> and now I tried to configure OpenMPI 3.1.6 with ./configure ./configure
> —with-pmi=/usr
> 
> 
> and this is the output:
> 
> checking if user requested PMI support... yes
> checking for pmi.h in /usr... not found
> checking for pmi.h in /usr/include... found
> checking pmi.h usability... yes
> checking pmi.h presence... yes
> checking for pmi.h... yes
> checking for libpmi in /usr/lib... checking for libpmi in /usr/lib64...
> not found
> checking for pmi2.h in /usr... not found
> checking for pmi2.h in /usr/include... found
> checking pmi2.h usability... yes
> checking pmi2.h presence... yes
> checking for pmi2.h... yes
> checking for libpmi2 in /usr/lib... checking for libpmi2 in
> /usr/lib64... not found
> checking can PMI support be built... no
> configure: WARNING: PMI support requested (via --with-pmi) but neither
> pmi.h
> configure: WARNING: nor pmi2.h were found under locations:
> configure: WARNING: /usr
> configure: WARNING: /usr/slurm
> configure: WARNING: Specified path: /usr
> configure: WARNING: OR neither libpmi nor libpmi2 were found under:
> configure: WARNING: /lib
> configure: WARNING: /lib64
> configure: WARNING: Specified path: 
> configure: error: Aborting
> 
> So it finds the pmi.h and pmi2.h but not the libpmi and libpmi2. Is
> there a way to install pmi2 first? Where do I find this? Slurm is
> installed in /usr/local but I don't find the files libpmi or libpmi2
> anywhere on the machine. 
> Pmix is not necessary because the cluster is quite small with 5 nodes
> with 64 cpus in total.
> 
> Any help is appreciated. Thanks in advance and all the Best!
> 
> Philipp Heckmann
> 
> 
> 
> 
> 
> 
> 




Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-12-02 Thread Ralph Castain via users
Just a point to consider. OMPI does _not_ want to get in the mode of modifying 
imported software packages. That is a blackhole of effort we simply cannot 
afford.

The correct thing to do would be to flag Rob Latham on that PR and ask that he 
upstream the fix into ROMIO so we can absorb it. We shouldn't be committing 
such things directly into OMPI itself.

It's called "working with the community" as opposed to taking a point-solution 
approach :-)


> On Dec 2, 2020, at 8:46 AM, Mark Dixon via users  
> wrote:
> 
> Hi Mark,
> 
> Thanks so much for this - yes, applying that pull request against ompi 4.0.5 
> allows hdf5 1.10.7's parallel tests to pass on our Lustre filesystem.
> 
> I'll certainly be applying it on our local clusters!
> 
> Best wishes,
> 
> Mark
> 
> On Tue, 1 Dec 2020, Mark Allen via users wrote:
> 
>> At least for the topic of why romio fails with HDF5, I believe this is the 
>> fix we need (has to do with how romio processes the MPI datatypes in its 
>> flatten routine).  I made a different fix a long time ago in SMPI for that, 
>> then somewhat more recently it was re-broke it and I had to re-fix it.  So 
>> the below takes a little more aggressive approach, not totally redesigning 
>> the flatten function, but taking over how the array size counter is handled. 
>> https://github.com/open-mpi/ompi/pull/3975
>> Mark Allen
>>  




Re: [OMPI users] PRRTE DVM: how to tell prun to not share nodes among prun jobs?

2020-11-14 Thread Ralph Castain via users
That would be very kind of you and most welcome!

> On Nov 14, 2020, at 12:38 PM, Alexei Colin  wrote:
> 
> On Sat, Nov 14, 2020 at 08:07:47PM +0000, Ralph Castain via users wrote:
>> IIRC, the correct syntax is:
>> 
>> prun -host +e ...
>> 
>> This tells PRRTE that you want empty nodes for this application. You can 
>> even specify how many empty nodes you want:
>> 
>> prun -host +e:2 ...
>> 
>> I haven't tested that in a bit, so please let us know if it works or not so 
>> we can fix it if necessary.
> 
> Works! Thank you.
> 
> $ prun --map-by ppr:64:node --host +e -n 1  ./mpitest &
> $ prun --map-by ppr:64:node --host +e -n 1  ./mpitest &
> 
>   MPI World size = 1 processes
>   Hello World from rank 0 running on nid03835 (hostname nid03835)!
> 
>   MPI World size = 1 processes
>   Hello World from rank 0 running on nid03834 (hostname nid03834)!
> 
> Should I PR a patch to prun manpage to change this:
> 
>   -H, -host, --host 
> List of hosts on which to invoke processes.
> 
> to something like this?:
> 
>   -H, -host, --host 
> List of hosts on which to invoke processes. Pass
>+e to allocate only onto empty nodes (i.e. none of
>whose cores have been allocated to other prun jobs) or
>+e:N to allocate to nodes at least N of which are empty.




Re: [OMPI users] PRRTE DVM: how to tell prun to not share nodes among prun jobs?

2020-11-14 Thread Ralph Castain via users
IIRC, the correct syntax is:

prun -host +e ...

This tells PRRTE that you want empty nodes for this application. You can even 
specify how many empty nodes you want:

prun -host +e:2 ...

I haven't tested that in a bit, so please let us know if it works or not so we 
can fix it if necessary.

As for the queue - we do plan to add a queue to PRRTE in 1st quarter next year. 
Wasn't really thinking of a true scheduler - just a FIFO queue for now.


> On Nov 14, 2020, at 11:52 AM, Alexei Colin via users 
>  wrote:
> 
> Hi, in context of the PRRTE Distributed Virtual Machine, is there a way
> to tell the task mapper inside prun to not share a node across separate
> prun jobs?
> 
> For example, inside a resource allocation from Cobalt/ALPS: 2 nodes with
> 64 cores each:
> 
> prte --daemonize
> prun ... &
> ...
> prun ... &
> pterm
> 
> Scenario A:
> 
> $ prun --map-by ppr:64:node -n 64 ./mpitest &
> $ prun --map-by ppr:64:node -n 64 ./mpitest &
> 
>   MPI World size = 64 processes
>   Hello World from rank 0 running on nid03834 (hostname nid03834)!
>   ...
>   Hello World from rank 63 running on nid03834 (hostname nid03834)!
> 
>   MPI World size = 64 processes
>   Hello World from rank 0 running on nid03835 (hostname nid03835)!
>   ...
>   Hello World from rank 63 running on nid03835 (hostname nid03835)!
> 
> Scenario B:
> 
> $ prun --map-by ppr:64:node -n 1 ./mpitest &
> $ prun --map-by ppr:64:node -n 1 ./mpitest &
> 
>   MPI World size = 1 processes
>   Hello World from rank 0 running on nid03834 (hostname nid03834)!
> 
>   MPI World size = 1 processes
>   Hello World from rank 0 running on nid03834 (hostname nid03834)!
> 
> The question is: in Scneario B, how to tell prun that node nid03834
> should not be used for the second prun job, because this node is already
> (partially) occupied by a different prun instance job?
> 
> Scenario A implies that the DVM already tracks occupancy, so the
> question is just how to tell the mapper to treat a free core on a free
> node differently from a free core on a partially occupied node. The
> --map-by :NOOVERSUBSCRIBE does not look like the answer since there's
> no oversubscription of cores, right? Would need something like --map-by
> :exclusive:node? If not supported, how hard would it be for me to patch?
> 
> Potential workarounds I can think of is to fill the unoccupied cores on
> partially occupied nodes with dummy jobs with --host pointing to the
> partially occupied nodes and a -n count matching the number of
> unoccupied cores, but is this even doable? also requires dumping the
> mapping from each prun which I am unable to achive with --map-by
> :DISPLAY (works with mpirun but not with prun).
> 
> Or, run a Flux instance [1] instead of the PRRTE DVM on the resource
> allocation, which seems similar but features a scheduler with a queue (a
> feature proposed for the PRRTE DVM on the list earlier [1]). I am
> guessing that Flux has the flexibility to this exclusive node mapping,
> but not sure.
> 
> The DVM is proving to be very useful to deal with restrictions on
> minimum nodecount per job on some HPC clusters, by batching many small
> jobs into one job. A queue would be even more useful, but even without a
> queue it is still useful for batching sets of jobs which are known to
> fit on an allocation in parallel (i.e. without having to wait at all).
> 
> [1] https://flux-framework.readthedocs.io/en/latest/quickstart.html
> [2] https://www.mail-archive.com/users@lists.open-mpi.org/msg30692.html
> 
> OpenMPI: commit 7a922c8774b184ecb3aa1cd06720390bd9200b50
> Fri Nov 6 08:48:29 2020 -0800
> PRRTE: commit 37dd45c4d9fe973df1000f1a1421c2718fd80050
> Fri Nov 6 12:45:38 2020 -0600
> 
> Thank you.




Re: [OMPI users] [External] Re: mpi/pmix: ERROR: Error handler invoked: status = -25: No such file or directory (2)

2020-11-12 Thread Ralph Castain via users
Yeah - this can be safely ignored. Basically, what's happening is an async 
cleanup of a tmp directory and the code is barking that it wasn't found 
(because it was already deleted).


> On Nov 12, 2020, at 8:16 AM, Prentice Bisbal via users 
>  wrote:
> 
> I should give more background. In the slurm error log for this job, there was 
> another error about a memcpy operation failing listed first, so that caused 
> the job to fail. I suspect these errors below are the result of the other MPI 
> ranks being killed in a not exactly simultaneous manner, which is to be 
> expected. I just want to make sure that this was the case, and the error 
> below wasn't a sign of another issue with the job.
> 
> Prentice
> 
> On 11/11/20 5:47 PM, Ralph Castain via users wrote:
>> Looks like it is coming from the Slurm PMIx plugin, not OMPI.
>> 
>> Artem - any ideas?
>> Ralph
>> 
>> 
>>> On Nov 11, 2020, at 10:03 AM, Prentice Bisbal via users 
>>>  wrote:
>>> 
>>> One of my users recently reported a failed job that was using OpenMPI 4.0.4 
>>> compiled with PGI 20.4. There  two different errors reported. One was 
>>> reported once, and I think had nothing to do with OpenMPI or PMIX, and then 
>>> this error was repeated multiple times in the Slurm error output for the 
>>> job:
>>> 
>>> pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: 
>>> status = -25: No such file or directory (2)
>>> 
>>> Anyone else see this before? Any idea what would cause this error? I did a 
>>> google search but couldn't find any discussion of this error anywhere.
>>> 
>>> -- 
>>> Prentice
>>> 
>> 
> -- 
> Prentice Bisbal
> Lead Software Engineer
> Research Computing
> Princeton Plasma Physics Laboratory
> http://www.pppl.gov
> 




Re: [OMPI users] mpi/pmix: ERROR: Error handler invoked: status = -25: No such file or directory (2)

2020-11-11 Thread Ralph Castain via users
Looks like it is coming from the Slurm PMIx plugin, not OMPI.

Artem - any ideas?
Ralph


> On Nov 11, 2020, at 10:03 AM, Prentice Bisbal via users 
>  wrote:
> 
> One of my users recently reported a failed job that was using OpenMPI 4.0.4 
> compiled with PGI 20.4. There  two different errors reported. One was 
> reported once, and I think had nothing to do with OpenMPI or PMIX, and then 
> this error was repeated multiple times in the Slurm error output for the job:
> 
> pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: 
> status = -25: No such file or directory (2)
> 
> Anyone else see this before? Any idea what would cause this error? I did a 
> google search but couldn't find any discussion of this error anywhere.
> 
> -- 
> Prentice
> 




Re: [OMPI users] Starting a mixed fortran python MPMD application

2020-11-04 Thread Ralph Castain via users
Afraid I would have no idea - all I could tell them is that there was a bug and 
it has been fixed

On Nov 2, 2020, at 12:18 AM, Andrea Piacentini via users 
mailto:users@lists.open-mpi.org> > wrote:

 
 I installed version 4.0.5 and the problem appears to be fixed.
 Can you please help us explaining it to our users pointing to the commit (at 
least to the first release tag) solving this issue?
 
 Thank you very much
 Andrea Piacentini
 
-- 
 
 Andrea Piacentini 
 piacentini.p...@gmail.com 
 Tel. +39/3356007579 
 skype: andrea3.14 
 



Re: [OMPI users] Starting a mixed fortran python MPMD application

2020-10-28 Thread Ralph Castain via users
Could you please tell us what version of OMPI you are using?


On Oct 28, 2020, at 11:16 AM, Andrea Piacentini via users 
mailto:users@lists.open-mpi.org> > wrote:

 
 Good morning we need to launch a MPMD application with two fortran excutables 
and one interpreted python (mpi4py) application.
 
 The natural ordering in the resulting comm_world would come from
 
 mpirun -np 4 fortranexe_1 : -np 4 python3 pyapp.py : -np 2 fortranexe_2

 
 but openmpi seems to neglect (or loose at start time) the second fortran 
executable.
 Notice that the making pyapp.py executable with a shebang and avoiding 
mentioning python3 does not change the behaviour).
 Intelmpi can run the application.
 
 We tried (but we would prefer not for some assumptions on the position in the 
comm world) to change the program order and indeed
 
 mpirun -np 4 fortranexe_1 : -np 2 fortranexe_2 : -np 4 python3 pyapp.py

 
 runs to the end.
 
 I did not find any mention in the documentation or in the forums about a 
prescribed order for multilanguage applications, nor an alternative suggested 
syntax.
 
 Thank you for your help
 
 Andrea Piacentini
 (CERFACS Toulouse)
 
 
-- 
 
 Andrea Piacentini 
 piacentini.p...@gmail.com 
 Tel. +39/3356007579 
 skype: andrea3.14 
 



Re: [OMPI users] Limiting IP addresses used by OpenMPI

2020-09-30 Thread Ralph Castain via users
I'm not sure where you are looking, but those params are indeed present in the 
opal/mca/btl/tcp component:

/*
 *  Called by MCA framework to open the component, registers
 *  component parameters.
 */

static int mca_btl_tcp_component_register(void)
{
    char* message;

    /* register TCP component parameters */
    mca_btl_tcp_param_register_string("if_include", "Comma-delimited list of 
devices and/or CIDR notation of networks to use for MPI communication (e.g., 
\"eth0,192.168.0.0/16\").  Mutually exclusive with btl_tcp_if_exclude.", "", 
OPAL_INFO_LVL_1, &mca_btl_tcp_component.tcp_if_include);

    mca_btl_tcp_param_register_string("if_exclude", "Comma-delimited list of 
devices and/or CIDR notation of networks to NOT use for MPI communication -- 
all devices not matching these specifications will be used (e.g., 
\"eth0,192.168.0.0/16\").  If set to a non-default value, it is mutually 
exclusive with btl_tcp_if_include.",
                                      "127.0.0.1/8,sppp",
                                      OPAL_INFO_LVL_1, 
&mca_btl_tcp_component.tcp_if_exclude);


I added a little padding to make them clearer. This was from the v3.1.x branch, 
but those params have been there for a very long time. The 
"mca_btl_tcp_param_register_string" function adds the "btl_tcp_" prefix to the 
param.


On Sep 4, 2020, at 5:39 PM, Charles Doland via users mailto:users@lists.open-mpi.org> > wrote:

Joseph,

There is no specific case. We are working on supporting the use of OpenMPI with 
our software, in addition to Intel MPI. With Intel MPI, we find that using the 
I_MPI_TCP_NETMASK or I_MPI_NETMASK environment variables is useful in many 
cases in which the job hosts have multiple network interfaces.

I tried to use btl_tcp_if_include and btl_tcp_if_exclude, but neither seemed to 
have any effect. I also noticed that these options do not appear to be present 
in the source code. Although there were similar options for ptl in the source, 
my undestanding is that ptl has been replaced by btl. I tested using version 
3.1.2. The source that I examined was also version 3.1.2.

Charles Doland
charles.dol...@ansys.com  
(408) 627-6621  [x6621]


From: users mailto:users-boun...@lists.open-mpi.org> > on behalf of Joseph Schuchart via 
users mailto:users@lists.open-mpi.org> >
Sent: Tuesday, September 1, 2020 1:50 PM
To: users@lists.open-mpi.org   
mailto:users@lists.open-mpi.org> >
Cc: Joseph Schuchart mailto:schuch...@hlrs.de> >
Subject: Re: [OMPI users] Limiting IP addresses used by OpenMPI
 [External Sender]

Charles,

What is the machine configuration you're running on? It seems that there
are two MCA parameter for the tcp btl: btl_tcp_if_include and
btl_tcp_if_exclude (see ompi_info for details). There may be other knobs

I'm not aware of. If you're using UCX then my guess is that UCX has its
own way to choose the network interface to be used...

Cheers
Joseph

On 9/1/20 9:35 PM, Charles Doland via users wrote:
> Yes. It is not unusual to have multiple network interfaces on each host
> of a cluster. Usually there is a preference to use only one network
> interface on each host due to higher speed or throughput, or other
> considerations. It would be useful to be able to explicitly specify the
> interface to use for cases in which the MPI code does not select the
> preferred interface.
>
> Charles Doland
> charles.dol...@ansys.com  
>  >
> (408) 627-6621  [x6621]
> 
> *From:* users   > on behalf of John
> Hearns via users mailto:users@lists.open-mpi.org> >
> *Sent:* Tuesday, September 1, 2020 12:22 PM
> *To:* Open MPI Users   >
> *Cc:* John Hearns mailto:hear...@gmail.com> >
> *Subject:* Re: [OMPI users] Limiting IP addresses used by OpenMPI
>
> *[External Sender]*
>
> Charles, I recall using the I_MPI_NETMASK to choose which interface for
> MPI to use.
> I guess you are asking the same question for OpenMPI?
>
> On Tue, 1 Sep 2020 at 17:03, Charles Doland via users
> mailto:users@lists.open-mpi.org> 
>  >> wrote:
>
> Is there a way to limit the IP addresses or network interfaces used
> for communication by OpenMPI? I am looking for something similar to
> the I_MPI_TCP_NETMASK or I_MPI_NETMASK environment variables for
> Intel MPI.
>
> The OpenMPI documentation mentions the btl_tcp_if_include
> and btl_tcp_if_exclude MCA options. These do not  appear to be
> present, at least in OpenMPI v3.1.2. Is there another way to do
> this? Or are these options supported in a different version?
>
> Charles Doland
> charles.dol...@ansys.com  
>

Re: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

2020-08-20 Thread Ralph Castain via users
Feel free to holler if you run into trouble - it should be relatively easy to 
build and use PRRTE if you have done so for OMPI

On Aug 20, 2020, at 10:49 AM, Carlo Nervi mailto:carlo.ne...@unito.it> > wrote:

Thank you Ralph for the suggestion!
I will carefully consider it, although I'm a chemist and not a sysadmin (I miss 
a lot a specialized sysadmin in our Department!).
Carlo


Il giorno gio 20 ago 2020 alle ore 18:45 Ralph Castain via users 
mailto:users@lists.open-mpi.org> > ha scritto:
Your use-case sounds more like a workflow than an application - in which case, 
you probably should be using PRRTE to execute it instead of "mpirun" as PRRTE 
will "remember" the multiple jobs and avoid the overload scenario you describe.

This link will walk you thru how to get and build it: 
https://openpmix.github.io/code/getting-the-pmix-reference-server

This link will provide some directions on how to use it: 
https://openpmix.github.io/support/how-to/running-apps-under-psrvr

You would need to update your script to use "prun --personality ompi" instead 
of "mpirun", but things should otherwise be the same.


On Aug 20, 2020, at 9:01 AM, Carlo Nervi via users mailto:users@lists.open-mpi.org> > wrote:

Thank you, Christoph. I did not consider the --cpu-list.
However, this is okay if I have a single script that is launching several jobs 
(please note that each job may have a different number of CPUs). In my case I 
have the same script (that launches mpirun), which is called many times. The 
script is periodically called till all 48 CPUs are busy, and whenever other 
jobs are finished...

If the script will be the same, I'm afraid that all the jobs subsequent to the 
first will be bound to the same cpulist. Am I wrong?
mpirun -n xx --cpu-list "$(seq -s, 0 47)" --bind-to cpu-list:ordered $app
(with xx < 48) will bound all the processes to the first 6 CPUs?
I need a way to dynamically manage the cpulist, but I was hoping that mpirun 
can do the tasks.
If not, I'm afraid that the simplest solution is to use --bind-to none
Thank you,
Carlo


Il giorno gio 20 ago 2020 alle ore 17:24 Christoph Niethammer 
mailto:nietham...@hlrs.de> > ha scritto:
Hello Carlo,

If you execute multiple mpirun commands they will not know about each others 
resource bindings.
E.g. if you bind to cores each mpirun will start with the same core to assign 
with again.
This results then in over subscription of the cores, which slows down your 
programs - as you did realize.


You can use "--cpu-list" together with "--bind-to cpu-list:ordered"
So if you start all your simulations in a single script this would look like

mpirun -n 6 --cpu-list "$(seq -s, 0 5)" --bind-to cpu-list:ordered  $app
mpirun -n 6 --cpu-list "$(seq -s, 6 11)" --bind-to cpu-list:ordered  $app
...
mpirun -n 6 --cpu-list "$(seq -s, 42 47)" --bind-to cpu-list:ordered  $app


Best
Christoph


- Original Message -
From: "Open MPI Users" mailto:users@lists.open-mpi.org> >
To: "Open MPI Users" mailto:users@lists.open-mpi.org> >
Cc: "Carlo Nervi" mailto:carlo.ne...@unito.it> >
Sent: Thursday, 20 August, 2020 12:17:21
Subject: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

Dear OMPI community,
I'm a simple end-user with no particular experience.
I compile quantum chemical programs and use them in parallel.

My system is a 4 socket, 12 core per socket Opteron 6168 system for a total
of 48 cores and 64 Gb of RAM. It has 8 NUMA nodes:

openmpi $ hwloc-info
depth 0:           1 Machine (type #0)
 depth 1:          4 Package (type #1)
  depth 2:         8 L3Cache (type #6)
   depth 3:        48 L2Cache (type #5)
    depth 4:       48 L1dCache (type #4)
     depth 5:      48 L1iCache (type #9)
      depth 6:     48 Core (type #2)
       depth 7:    48 PU (type #3)
Special depth -3:  8 NUMANode (type #13)
Special depth -4:  3 Bridge (type #14)
Special depth -5:  5 PCIDev (type #15)
Special depth -6:  5 OSDev (type #16)

lstopo:

openmpi $ lstopo
Machine (63GB total)
  Package L#0
    L3 L#0 (5118KB)
      NUMANode L#0 (P#0 7971MB)
      L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0
(P#0)
      L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1
(P#1)
      L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2
(P#2)
      L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3
(P#3)
      L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4
(P#4)
      L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5
(P#5)
      HostBridge
        PCIBridge
          PCI 02:00.0 (Ethernet)
            Net "enp2s0f0"
          PCI 02:00.1 (Ethernet)
            Net "enp2s0f1"
        PCI 00:11.0 (RAID)
          Block(Disk) "sdb"
          Block(Disk

Re: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

2020-08-20 Thread Ralph Castain via users
Your use-case sounds more like a workflow than an application - in which case, 
you probably should be using PRRTE to execute it instead of "mpirun" as PRRTE 
will "remember" the multiple jobs and avoid the overload scenario you describe.

This link will walk you thru how to get and build it: 
https://openpmix.github.io/code/getting-the-pmix-reference-server

This link will provide some directions on how to use it: 
https://openpmix.github.io/support/how-to/running-apps-under-psrvr

You would need to update your script to use "prun --personality ompi" instead 
of "mpirun", but things should otherwise be the same.


On Aug 20, 2020, at 9:01 AM, Carlo Nervi via users mailto:users@lists.open-mpi.org> > wrote:

Thank you, Christoph. I did not consider the --cpu-list.
However, this is okay if I have a single script that is launching several jobs 
(please note that each job may have a different number of CPUs). In my case I 
have the same script (that launches mpirun), which is called many times. The 
script is periodically called till all 48 CPUs are busy, and whenever other 
jobs are finished...

If the script will be the same, I'm afraid that all the jobs subsequent to the 
first will be bound to the same cpulist. Am I wrong?
mpirun -n xx --cpu-list "$(seq -s, 0 47)" --bind-to cpu-list:ordered $app
(with xx < 48) will bound all the processes to the first 6 CPUs?
I need a way to dynamically manage the cpulist, but I was hoping that mpirun 
can do the tasks.
If not, I'm afraid that the simplest solution is to use --bind-to none
Thank you,
Carlo


Il giorno gio 20 ago 2020 alle ore 17:24 Christoph Niethammer 
mailto:nietham...@hlrs.de> > ha scritto:
Hello Carlo,

If you execute multiple mpirun commands they will not know about each others 
resource bindings.
E.g. if you bind to cores each mpirun will start with the same core to assign 
with again.
This results then in over subscription of the cores, which slows down your 
programs - as you did realize.


You can use "--cpu-list" together with "--bind-to cpu-list:ordered"
So if you start all your simulations in a single script this would look like

mpirun -n 6 --cpu-list "$(seq -s, 0 5)" --bind-to cpu-list:ordered  $app
mpirun -n 6 --cpu-list "$(seq -s, 6 11)" --bind-to cpu-list:ordered  $app
...
mpirun -n 6 --cpu-list "$(seq -s, 42 47)" --bind-to cpu-list:ordered  $app


Best
Christoph


- Original Message -
From: "Open MPI Users" mailto:users@lists.open-mpi.org> >
To: "Open MPI Users" mailto:users@lists.open-mpi.org> >
Cc: "Carlo Nervi" mailto:carlo.ne...@unito.it> >
Sent: Thursday, 20 August, 2020 12:17:21
Subject: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

Dear OMPI community,
I'm a simple end-user with no particular experience.
I compile quantum chemical programs and use them in parallel.

My system is a 4 socket, 12 core per socket Opteron 6168 system for a total
of 48 cores and 64 Gb of RAM. It has 8 NUMA nodes:

openmpi $ hwloc-info
depth 0:           1 Machine (type #0)
 depth 1:          4 Package (type #1)
  depth 2:         8 L3Cache (type #6)
   depth 3:        48 L2Cache (type #5)
    depth 4:       48 L1dCache (type #4)
     depth 5:      48 L1iCache (type #9)
      depth 6:     48 Core (type #2)
       depth 7:    48 PU (type #3)
Special depth -3:  8 NUMANode (type #13)
Special depth -4:  3 Bridge (type #14)
Special depth -5:  5 PCIDev (type #15)
Special depth -6:  5 OSDev (type #16)

lstopo:

openmpi $ lstopo
Machine (63GB total)
  Package L#0
    L3 L#0 (5118KB)
      NUMANode L#0 (P#0 7971MB)
      L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0
(P#0)
      L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1
(P#1)
      L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2
(P#2)
      L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3
(P#3)
      L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4
(P#4)
      L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5
(P#5)
      HostBridge
        PCIBridge
          PCI 02:00.0 (Ethernet)
            Net "enp2s0f0"
          PCI 02:00.1 (Ethernet)
            Net "enp2s0f1"
        PCI 00:11.0 (RAID)
          Block(Disk) "sdb"
          Block(Disk) "sdc"
          Block(Disk) "sda"
        PCI 00:14.1 (IDE)
        PCIBridge
          PCI 01:04.0 (VGA)
    L3 L#1 (5118KB)
      NUMANode L#1 (P#1 8063MB)
      L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6
(P#6)
      L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7
(P#7)
      L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8
(P#8)
      L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9
(P#9)
      L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU
L#10 (P#10)
      L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU
L#11 (P#11)
  Package L#1
    L3 L#2 (5118KB)
      NUMANode L#2 (P#2 8063MB)
  

Re: [OMPI users] Issues with MPI_Comm_Spawn

2020-08-12 Thread Ralph Castain via users
No, I'm afraid that won't do anything - all the info key does is tell the 
launcher (mpirun or whatever you used) that you would like it to collect and 
distribute the stdin. However, it will just be ignored if you don't have the 
launcher there to provide that support.

How are you planning on starting these processes?


On Aug 12, 2020, at 9:59 AM, Alvaro Payero Pinto via users 
mailto:users@lists.open-mpi.org> > wrote:

I was planning to create a MPI Info variable with a pair of ompi stdin target 
as key and all as value. Is not the way of archieve this goal?

El mié., 12 ago. 2020 18:29, Ralph Castain via users mailto:users@lists.open-mpi.org> > escribió:
Setting aside the known issue with comm_spawn in v4.0.4, how are you planning 
to forward stdin without the use of "mpirun"? Something has to collect stdin of 
the terminal and distribute it to the stdin of the processes

> On Aug 12, 2020, at 9:20 AM, Alvaro Payero Pinto via users 
> mailto:users@lists.open-mpi.org> > wrote:
> 
> Hi,
> 
> I’m using OpenMPI 4.0.4, where the Fortran side has been compiled with Intel 
> Fortran suite v17.0.6 and the C/C++ side with GNU suite v4.8.5 due to 
> requirements I cannot modify.
> 
> I am trying to parallelise a Fortran application by dynamically creating 
> processes on the fly with “MPI_Comm_Spawn” subroutine. The application starts 
> with only one parent and it takes a file throughout standard input, but it 
> looks like children are not inheriting the access to such file. I would like 
> to have all children processes inheriting the standard input of the parent. 
> I’m aware that perhaps the “-stdin all” argument of the “mpirun“ binary might 
> do it, but I am attempting to execute the binary without mpirun unless 
> strictly necessary.
>  
> So far, I have already tried to pass a non-null “MPI_Info“ to 
> “MPI_Comm_Spawn” with a key of “ompi_stdin_target“ and a value of “all” but 
> it does not work. I have also tried other values like none, 0, 1, -1, etc.) 
> without success either.
> 
> Here is the subroutine provoking the error at the MPI_Comm_spawn call:

> 
> ===
> SUBROUTINE ADmn_createSpawn(iNumberChilds, iCommBigWorld, iIDMpi,    
> iNumberProcess)
>     IMPLICIT NONE
> 
>     !ID of the communicator that contains all the process
>     INTEGER:: iCommBigWorld
>     !Number of child process
>     INTEGER :: iNumberChilds
>     INTEGER:: iNumberProcess
>      CHARACTER(LEN=1)                         :: arguments(1)
>      INTEGER                                  :: bigWorld, iC, iInic, iFinal;
>      INTEGER                                  :: ierror
>       INTEGER                                  :: iIDFamiliar=0;
> 
>     CHARACTER(LEN=128)        :: command
>     INTEGER                               :: iInfoMPI
>     CHARACTER(LEN=*), Parameter  :: key=" ompi_stdin_target ",valueMPI= "all";
>     logical :: FLAG
> 
>     !Id number of the current process
>     INTEGER :: iIDMpi
> 
>     CALL GET_COMMAND_ARGUMENT(0, command)
>     CALL MPI_Comm_get_parent(iParent, ierror)
> 
>     IF (iParent .EQ. MPI_COMM_NULL) THEN
>         arguments(1) = ''
>         iIDFamiliar = 0;
> 
>         call MPI_INFO_CREATE(iInfoMPI, ierror)
>         call MPI_INFO_SET(iInfoMPI, key, valueMPI, ierror)
> 
>         CALL MPI_Comm_spawn(command, arguments, iNumberChilds, iInfoMPI, 0, 
>MPI_COMM_WORLD, iChild, iSpawn_error, ierror)
>        
>         CALL MPI_INTERCOMM_MERGE(iChild, .false., iCommBigWorld, ierror)
>     ELSE
>         call MPI_COMM_RANK(MPI_COMM_WORLD, iIDFamiliar, ierror)
> 
>         iIDFamiliar = iIDFamiliar + 1;
> 
>         CALL MPI_INTERCOMM_MERGE(iParent, .true., iCommBigWorld, ierror)
>     END IF
> 
>     CALL MPI_COMM_RANK(iCommBigWorld,iIDMpi,ierror)
>     call MPI_COMM_SIZE(iCommBigWorld, intasks, ierror)
>     iProcessIDInternal = iIDMpi
>     iNumberProcess = intasks
> 
> END SUBROUTINE ADmn_createSpawn
> ===
> 
> Binary is executed as:
> Binaryname.bin < inputfilename.dat
> And here is the segmentation fault produced when passing the MPI_Info 
> variable to MPI_Comm_Spawn:
> 
> ===
> [sles12sp3-srv:10384] *** Process received signal ***
> [sles12sp3-srv:10384] Signal: Segmentation fault (11)
> [sles12sp3-srv:10384] Signal code: Address not mapped (1)
> [sles12sp3-srv:10384] Failing at address: 0xfffe
> [sles12sp3-srv:10384] [ 0] /lib64/libpthread.so.0(+0x10c10)[0x7fc6a8dd5c10]
> [sles12sp3-srv:10384] [ 1] 
> /usr/local/lib64/libopen-rte.so.40(pmix_server_spawn_fn+0x1052)[

Re: [OMPI users] Issues with MPI_Comm_Spawn

2020-08-12 Thread Ralph Castain via users
Setting aside the known issue with comm_spawn in v4.0.4, how are you planning 
to forward stdin without the use of "mpirun"? Something has to collect stdin of 
the terminal and distribute it to the stdin of the processes

> On Aug 12, 2020, at 9:20 AM, Alvaro Payero Pinto via users 
>  wrote:
> 
> Hi,
> 
> I’m using OpenMPI 4.0.4, where the Fortran side has been compiled with Intel 
> Fortran suite v17.0.6 and the C/C++ side with GNU suite v4.8.5 due to 
> requirements I cannot modify.
> 
> I am trying to parallelise a Fortran application by dynamically creating 
> processes on the fly with “MPI_Comm_Spawn” subroutine. The application starts 
> with only one parent and it takes a file throughout standard input, but it 
> looks like children are not inheriting the access to such file. I would like 
> to have all children processes inheriting the standard input of the parent. 
> I’m aware that perhaps the “-stdin all” argument of the “mpirun“ binary might 
> do it, but I am attempting to execute the binary without mpirun unless 
> strictly necessary.
>  
> So far, I have already tried to pass a non-null “MPI_Info“ to 
> “MPI_Comm_Spawn” with a key of “ompi_stdin_target“ and a value of “all” but 
> it does not work. I have also tried other values like none, 0, 1, -1, etc.) 
> without success either.
> 
> Here is the subroutine provoking the error at the MPI_Comm_spawn call:
> 
> ===
> SUBROUTINE ADmn_createSpawn(iNumberChilds, iCommBigWorld, iIDMpi,
> iNumberProcess)
> IMPLICIT NONE
> 
> !ID of the communicator that contains all the process
> INTEGER:: iCommBigWorld
> !Number of child process
> INTEGER :: iNumberChilds
> INTEGER:: iNumberProcess
>  CHARACTER(LEN=1) :: arguments(1)
>  INTEGER  :: bigWorld, iC, iInic, iFinal;
>  INTEGER  :: ierror
>   INTEGER  :: iIDFamiliar=0;
> 
> CHARACTER(LEN=128):: command
> INTEGER   :: iInfoMPI
> CHARACTER(LEN=*), Parameter  :: key=" ompi_stdin_target ",valueMPI= "all";
> logical :: FLAG
> 
> !Id number of the current process
> INTEGER :: iIDMpi
> 
> CALL GET_COMMAND_ARGUMENT(0, command)
> CALL MPI_Comm_get_parent(iParent, ierror)
> 
> IF (iParent .EQ. MPI_COMM_NULL) THEN
> arguments(1) = ''
> iIDFamiliar = 0;
> 
> call MPI_INFO_CREATE(iInfoMPI, ierror)
> call MPI_INFO_SET(iInfoMPI, key, valueMPI, ierror)
> 
> CALL MPI_Comm_spawn(command, arguments, iNumberChilds, iInfoMPI, 0, 
> MPI_COMM_WORLD, iChild, iSpawn_error, ierror)
>
> CALL MPI_INTERCOMM_MERGE(iChild, .false., iCommBigWorld, ierror)
> ELSE
> call MPI_COMM_RANK(MPI_COMM_WORLD, iIDFamiliar, ierror)
> 
> iIDFamiliar = iIDFamiliar + 1;
> 
> CALL MPI_INTERCOMM_MERGE(iParent, .true., iCommBigWorld, ierror)
> END IF
> 
> CALL MPI_COMM_RANK(iCommBigWorld,iIDMpi,ierror)
> call MPI_COMM_SIZE(iCommBigWorld, intasks, ierror)
> iProcessIDInternal = iIDMpi
> iNumberProcess = intasks
> 
> END SUBROUTINE ADmn_createSpawn
> ===
> 
> Binary is executed as:
> Binaryname.bin < inputfilename.dat
> And here is the segmentation fault produced when passing the MPI_Info 
> variable to MPI_Comm_Spawn:
> 
> ===
> [sles12sp3-srv:10384] *** Process received signal ***
> [sles12sp3-srv:10384] Signal: Segmentation fault (11)
> [sles12sp3-srv:10384] Signal code: Address not mapped (1)
> [sles12sp3-srv:10384] Failing at address: 0xfffe
> [sles12sp3-srv:10384] [ 0] /lib64/libpthread.so.0(+0x10c10)[0x7fc6a8dd5c10]
> [sles12sp3-srv:10384] [ 1] 
> /usr/local/lib64/libopen-rte.so.40(pmix_server_spawn_fn+0x1052)[0x7fc6aa283232]
> [sles12sp3-srv:10384] [ 2] 
> /usr/local/lib64/openmpi/mca_pmix_pmix3x.so(+0x46210)[0x7fc6a602b210]
> [sles12sp3-srv:10384] [ 3] 
> /usr/local/lib64/openmpi/mca_pmix_pmix3x.so(pmix_server_spawn+0x7c6)[0x7fc6a60a5ab6]
> [sles12sp3-srv:10384] [ 4] 
> /usr/local/lib64/openmpi/mca_pmix_pmix3x.so(+0xb1a2f)[0x7fc6a6096a2f]
> [sles12sp3-srv:10384] [ 5] 
> /usr/local/lib64/openmpi/mca_pmix_pmix3x.so(pmix_server_message_handler+0x41)[0x7fc6a6097511]
> [sles12sp3-srv:10384] [ 6] 
> /usr/local/lib64/openmpi/mca_pmix_pmix3x.so(OPAL_MCA_PMIX3X_pmix_ptl_base_process_msg+0x1bf)[0x7fc6a610481f]
> [sles12sp3-srv:10384] [ 7] 
> /usr/local/lib64/libopen-pal.so.40(opal_libevent2022_event_base_loop+0x8fc)[0x7fc6a9facd6c]
> [sles12sp3-srv:10384] [ 8] 
> /usr/local/lib64/openmpi/mca_pmix_pmix3x.so(+0xcf7ce)[0x7fc6a60b47ce]
> [sles12sp3-srv:10384] [ 9] /lib64/libpthread.so.0(+0x8724)[0x7fc6a8dcd724]
> [sles12sp3-srv:10384] [10] /lib64/libc.so.6(clone+0x6d)[0x7fc6a8b0ce8d]
> [sles12sp3-srv:10384] *** End of error message ***
> 
> 

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-11 Thread Ralph Castain via users
Howard - if there is a problem in PMIx that is causing this problem, then we 
really could use a report on it ASAP as we are getting ready to release v3.1.6 
and I doubt we have addressed anything relevant to what is being discussed here.



On Aug 11, 2020, at 4:35 PM, Martín Morales via users mailto:users@lists.open-mpi.org> > wrote:

Hi Howard.
 Great!, that works for the crashing problem with OMPI 4.0.4. However It stills 
hanging if I remove “master” (host which launches spawning processes) from my 
hostfile.
I need spawn only in “worker”. Is there a way or workaround for doing this 
without mpirun?
Thanks a lot for your assistance. 
 Martín
From: Howard Pritchard  
Sent: lunes, 10 de agosto de 2020 19:13
To: Martín Morales  
Cc: Open MPI Users  
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.
 Hi Martin,
 I was able to reproduce this with 4.0.x branch.  I'll open an issue.  
 If you really want to use 4.0.4, then what you'll need to do is build an 
external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and then 
build Open MPI using the --with-pmix=where your pmix is installed
You will also need to build both Open MPI and PMIx against the same libevent.   
There's a configure option with both packages to use an external libevent 
installation.
 Howard
  Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales 
mailto:martineduardomora...@hotmail.com> >:
Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have to post 
this on the bug section? Thanks and regards.

 
Martín

 
From: Howard Pritchard  
Sent: lunes, 10 de agosto de 2020 14:44
To: Open MPI Users  
Cc: Martín Morales  
Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically 
processes allocation. OMPI 4.0.1 don't.
 
Hello Martin,
 
Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx version 
that introduced a problem with spawn for the 4.0.2-4.0.4 versions.
This is supposed to be fixed in the 4.0.5 release.  Could you try the 4.0.5rc1 
tarball and see if that addresses the problem you're seeing?
 
https://www.open-mpi.org/software/ompi/v4.0/
 
Howard
 
 
 
Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users 
mailto:users@lists.open-mpi.org> >:
 
Hello people!
I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one "master", 
one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded OMPI just like 
this:
 
./configure --prefix=/usr/local/openmpi-4.0.4/bin/
 
My hostfile is this:
 
master slots=2
worker slots=2
 
I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
If I launch the processes only on the "master" machine It's ok. But if I use 
the hostfile crashes with this:
 
--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M
  Process 2 ([[35155,1],0]) is on host: unknown!
  BTLs attempted: tcp self

Your MPI job is now going to abort; sorry.
--
[nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file 
dpm/dpm.c at line 493
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
[nos-GF7050VT-M:22526] *** An error occurred in MPI_Init
[nos-GF7050VT-M:22526] *** reported by process [2303918082,1]
[nos-GF7050VT-M:22526] *** on a NULL communicator
[nos-GF7050VT-M:22526] *** Unknown error
[nos-GF7050VT-M:22526] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[nos-GF7050VT-M:22526] ***    and potentially your MPI job)
 
Note: host "nos-GF7050VT-M" is "worker"
 
But If I run without "master" in hostfile, the processes are launched but It 
hangs: MPI_Init() doesn't returns.
I launched the script (pasted below) in this 2 ways with the same result:
 
$ ./simple_spawn 2
$ 

Re: [OMPI users] ORTE HNP Daemon Error - Generated by Tweaking MTU

2020-08-10 Thread Ralph Castain via users
My apologies - I should have included "--debug-daemons" for the mpirun cmd line 
so that the stderr of the backend daemons would be output.


> On Aug 10, 2020, at 10:28 AM, John Duffy via users  
> wrote:
> 
> Thanks Ralph
> 
> I will do all of that. Much appreciated.




Re: [OMPI users] ORTE HNP Daemon Error - Generated by Tweaking MTU

2020-08-10 Thread Ralph Castain via users
Well, we aren't really that picky :-)  While I agree with Gilles that we are 
unlikely to be able to help you resolve the problem, we can give you a couple 
of ideas on how to chase it down

First, be sure to build OMPI with "--enable-debug" and then try adding "--mca 
oob_base_verbose 100" to you mpirun cmd line. That will dump a bunch of 
diagnostics from the daemon-to-daemon TCP code.

Second, you can look at that code and see if you can spot something that might 
be MTU sensitive - the code is in the orte/mca/oob/tcp directory.

HTH
Ralph


> On Aug 9, 2020, at 10:37 PM, John Duffy via users  
> wrote:
> 
> Thanks Gilles
> 
> I realise this is “off topic”. I was hoping the Open-MPI ORTE/HNP message 
> might give me a clue where to look for my driver problem.
> 
> Regarding P/Q ratios, indeed P=2 & Q=16 does indeed give me better 
> performance.
> 
> Kind regards



Re: [OMPI users] MPI is still dominant paradigm?

2020-08-07 Thread Ralph Castain via users
The Java bindings were added specifically to support the Spark/Hadoop 
communities, so I see no reason why you couldn't use them for Akka or whatever. 
Note that there are also Python wrappers for MPI at mpi4py that you could build 
upon.

There is plenty of evidence out there for a general migration to higher-level 
programming blocks built on top of MPI, so I wouldn't call it a "pipe dream" at 
all.


> On Aug 7, 2020, at 5:26 AM, Oddo Da via users  
> wrote:
> 
> Hello,
> 
> This may be a bit of a longer post and I am not sure if it is even 
> appropriate here but I figured I ask. There are no hidden agendas in it, so 
> please treat it as "asking for opinions/advice", as opposed to judging or 
> provoking.
> 
> For the period between 2010 to 2017 I used to work in (buzzword alert!) "big 
> data" (meaning Spark, HDFS, reactive stuff like Akka) but way before that in 
> the early 2000s I used to write basic multithreaded C and some MPI code. I 
> came back to HPC/academia two years ago and what struck me was that (for lack 
> of better word) the field is still "stuck" (again, for lack of better word) 
> on MPI. This itself may seem negative in this context, however, I am just 
> stating my observation, which may be wrong.
> 
> I like low level programming and I like being in control of what is going on 
> but having had the experience in Spark and Akka, I kind of got spoiled. Yes, 
> I understand that the latter has fault-tolerance (which is nice) and MPI 
> doesn't (or at least, didn't when I played with in 1999-2005) but I always 
> felt like MPI needed higher level abstractions as a CHOICE (not _only_ 
> choice) laid over the bare metal offerings. The whole world has moved onto 
> programming in patterns and higher level abstractions, why is the 
> academic/HPC world stuck on bare metal, still? Yes, I understand that 
> performance often matters and the higher up you go, the more performance loss 
> you incur, however, there is also something to be said about developer time 
> and ease of understanding/abstracting etc. etc.
> 
> Be that as it may, I am working on a project now in the HPC world and I 
> noticed that Open MPI has Java bindings (or should I say "interface"?). What 
> is the state of those? Which JDK do they support? Most importantly, would it 
> be a HUGE pipe dream to think about building patterns a-la Akka (or even 
> mixing actual Akka implementation) on top of OpenMPI via this Java bridge? 
> What would be involved on the OpenMPI side? I have time/interest in going 
> this route if there would be any hope of coming up with something that would 
> make my life (and future people coming into HPC/MPI) easier in terms of 
> building applications. I am not saying MPI in C/C++/Fortran should go away, 
> however, sometimes we don't need the low-level stuff to express a concept 
> :-). It may also open a whole new world for people on large clusters...
> 
> Thank you!




Re: [OMPI users] Correct mpirun Options for Hybrid OpenMPI/OpenMP

2020-08-03 Thread Ralph Castain via users
Be default, OMPI will bind your procs to a single core. You probably want to at 
least bind to socket (for NUMA reasons), or not bind at all if you want to use 
all the cores on the node.

So either add "--bind-to socket" or "--bind-to none" to your cmd line.


On Aug 3, 2020, at 1:33 AM, John Duffy via users mailto:users@lists.open-mpi.org> > wrote:

Hi

I’m experimenting with hybrid OpenMPI/OpenMP Linpack benchmarks on my small 
cluster, and I’m a bit confused as to how to invoke mpirun.

I have compiled/linked HPL-2.3 with OpenMPI and libopenblas-openmp using the 
GCC -fopenmp option on Ubuntu 20.04 64-bit.

With P=1 and Q=1 in HPL.dat, if I use…

mpirun -x OMP_NUM_THREADS=4 xhpl

top reports...
 top - 08:03:59 up 1 day, 0 min,  1 user,  load average: 2.25, 1.23, 0.88
Tasks: 138 total,   2 running, 136 sleeping,   0 stopped,   0 zombie
%Cpu(s): 77.1 us, 22.2 sy,  0.0 ni,  0.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3793.3 total,    434.0 free,   2814.1 used,    545.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    919.9 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
                                                                          
   5787 john      20   0 2959408   2.6g   8128 R 354.0  69.1   2:10.43 xhpl     
                                                                          
   5789 john      20   0  263352   9960   7440 S  14.2   0.3   0:07.42 xhpl     
                                                                          
   5788 john      20   0  263352   9844   7320 S  13.9   0.3   0:07.19 xhpl     
                                                                          
   5790 john      20   0  263356   9896   7376 S  13.6   0.3   0:07.17 xhpl     
                                                                          

… which seems reasonable, but I don’t understand why there are 4 xhpl processes.


In anticipation of adding more nodes, if I use…

mpirun --host node1 --map-by ppr:1:node -x OMP_NUM_THREADS=4 xhpl

top reports...

top - 07:56:27 up 23:52,  1 user,  load average: 1.00, 0.98, 0.68
Tasks: 133 total,   2 running, 131 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.1 us,  0.0 sy,  0.0 ni, 74.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3793.3 total,    454.2 free,   2794.5 used,    544.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    939.9 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
                                                                          
   5770 john      20   0 2868700   2.5g   7668 R  99.7  68.7   5:20.37 xhpl     
                                                                          

… a single xhpl process (as expected), but with only 25% CPU utilisation and no 
other processes running on the other 3 cores. It would appear OpenBLAS is not 
utilising the 4 cores as expected.


If I then scale it to 2 nodes, with P=1 and Q=2 in HPL.dat...

mpirun --host node1,node2 --map-by ppr:1:node -x OMP_NUM_THREADS=4 xhpl

… similarly, I get a single process on each node, with only 25% CPU utilisation.


Any advice/suggestions on how to involve mpirun in a hybrid OpenMPI/OpenMP 
setup would be appreciated.

Kind regards






Re: [OMPI users] Running with Intel Omni-Path

2020-08-01 Thread Ralph Castain via users
Add "--mca pml cm" to your cmd line


On Jul 31, 2020, at 9:54 PM, Supun Kamburugamuve via users 
mailto:users@lists.open-mpi.org> > wrote:

Hi all,

I'm trying to setup OpenMPI on a cluster with the Omni-Path network. When I try 
the following command it gives an error.

mpirun -n 2 --hostfile nodes --mca mtl psm2 ./osu_bw
# OSU MPI Bandwidth Test v5.6.3
# Size      Bandwidth (MB/s)
--
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: t-001
  PID:        63725
  Message:    connect() to IP:1024 failed
  Error:      Operation now in progress (115)
--

I don't want it to use the network it tries to connect to. Any pointers to 
solve this?

Best,
-- 
Supun Kamburugamuve, PhD
Digital Science Center, Indiana University
Member, Apache Software Foundation; http://www.apache.org 
 
E-mail: supun@apache.o  rg;  Mobile: +1 812 219 2563





Re: [OMPI users] Moving an installation

2020-07-24 Thread Ralph Castain via users
While possible, it is highly unlikely that your desktop version is going to be 
binary compatible with your cluster...

On Jul 24, 2020, at 9:55 AM, Lana Deere via users mailto:users@lists.open-mpi.org> > wrote:

I have open-mpi 4.0.4 installed on my desktop and my small test programs are 
working.

I would like to migrate the open-mpi to a cluster and run a larger program 
there.  When moved, the open-mpi installation is in a different pathname than 
it was on my desktop and it doesn't seem to work any longer.  I can make the 
libraries visible via LD_LIBRARY_PATH but this seems insufficient.  Is there an 
environment variable which can be used to tell the open-mpi where it is 
installed?

Is it mandatory to actually compile the release in the ultimate destination on 
each system where it will be used?

Thanks.

.. Lana (lana.de...@gmail.com  )





Re: [OMPI users] Any reason why I can't start an mpirun job from within an mpi process?

2020-07-11 Thread Ralph Castain via users
You cannot cascade mpirun cmds like that - the child mpirun picks up envars 
that causes it to break. You'd have to either use comm_spawn to start the child 
job, or do a fork/exec where you can set the environment to be some pristine 
set of values.


> On Jul 11, 2020, at 1:12 PM, John Retterer via users 
>  wrote:
> 
> From the rank #0 process, I wish to start another mpi job (to create some 
> data to be used by the parent job).
> I'm trying to do this with the command
> 
>istat= system( "/...path.../mpirun -np 2 /...path.../prog2 args >file.log 
> 2>&1" )
> 
> within the code executed by the rank #0 process, where ...path...  is the 
> path to the executables, prog2 is the daughter program, and args its 
> arguments.
> 
> On return, the status istat = 256, and, although the log file file.log has 
> been created, it is empty.
> 
> DOes anybody have an idea why this doesn't work?




Re: [OMPI users] slot number calculation when no config files?

2020-06-08 Thread Ralph Castain via users
Note that you can also resolve it by adding --use-hwthread-cpus to your cmd 
line - it instructs mpirun to treat the HWTs as independent cpus so you would 
have 4 slots in this case.


> On Jun 8, 2020, at 11:28 AM, Collin Strassburger via users 
>  wrote:
> 
> Hello David,
> 
> The slot calculation is based on physical cores rather than logical cores.  
> The 4 CPUs you are seeing there are logical CPUs.  And since your processor 
> has 2 threads per core, you have two physical cores; yielding a total of 4 
> logical cores (which is reported to lscpu).  On machines with only 1 thread 
> per core, the cpu number is the number of physical cores.
> 
> Thanks,
> Collin
> 
> 
> -Original Message-
> From: users  On Behalf Of David Mathog via 
> users
> Sent: Monday, June 8, 2020 2:19 PM
> To: users@lists.open-mpi.org
> Cc: David Mathog 
> Subject: [OMPI users] slot number calculation when no config files?
> 
> Using OpenMPI 4.0.1 and no configuration files of any kind on a linux machine 
> which shows 4 CPUS and 2 threads per CPU.  When this is run:
> 
> mpirun --allow-run-as-root --oversubscribe -np 3 
> /usr/common/modules/el8/x86_64/software/q6/6.0.1-CentOS-vanilla/bin/qdynp
> eq2.inp
> 
> it works and the expected 3 processes run.  However if  oversubscribe is 
> omitted then this happens:
> 
> There are not enough slots available in the system to satisfy the 3 slots 
> that were requested by the application:
> 
> It will run with 2 though even without -oversubscribe.  It looks like it is 
> using CPUs/2 to calculate the slot limit and ignoring threads.
> Where is this slot calculation documented?
> 
> More details:
> 
> 
> $ cat /etc/centos-release
> CentOS Linux release 8.1.1911 (Core)
> $ mpirun --version
> mpirun (Open MPI) 4.0.1
> $ lscpu | head -6
> Architecture:x86_64
> CPU op-mode(s):  32-bit, 64-bit
> Byte Order:  Little Endian
> CPU(s):  4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> $ gcc --version
> gcc (GCC) 8.3.1 20190507 (Red Hat 8.3.1-4)
> 
> Thanks,
> 
> David Mathog




Re: [OMPI users] Running mpirun with grid

2020-06-01 Thread Ralph Castain via users
Afraid I have no real ideas here. Best I can suggest is taking the qrsh cmd 
line from the prior debug output and try running it manually. This might give 
you a chance to manipulate it and see if you can identify what is causing it an 
issue, if anything. Without mpirun executing, the daemons will bark about being 
unable to connect back, so you might need to use some other test program for 
this purpose.

I agree with Jeff - you should check to see where these messages are coming 
from:


>> Server daemon successfully started with task id "1.cod4"
>> Server daemon successfully started with task id "1.cod5"
>> Server daemon successfully started with task id "1.cod6"
>> Server daemon successfully started with task id "1.has6"
>> Server daemon successfully started with task id "1.hpb12"
>> Server daemon successfully started with task id "1.has4"
> 
>> Unmatched ".
>> Unmatched ".
>> Unmatched ".
> 


Could be a clue as to what is actually happening.


> On Jun 1, 2020, at 1:57 PM, Kulshrestha, Vipul via users 
>  wrote:
> 
> Thank Jeff & Ralph for your responses.
> 
> I tried changing the verbose level to 5 using the option suggested by Ralph, 
> but there was no difference in the output (so no additional information in 
> the output).
> 
> I also tried to replace the grid submission script to a command line qsub job 
> submission, but got the same issue. Removing the use of job submission 
> script, the qsub command looks like below. This uses mpirun option "--N 1" to 
> ensure that only 1 process is launched by mpirun on one host.
> 
> Do you have some suggestion on how I can go about investigating the root 
> cause of the problem I am facing? I am able to run mpirun successfully, if I 
> specify the same set of hosts (as allocated by grid) using mpirun host file. 
> I have also pasted the verbose output with host file and the orted command 
> looks very similar to the one generated for grid submission (except that it 
> uses /usr/bin/ssh instead of /grid2/sge/bin/lx-amd64/qrsh.
> 
> Thanks,
> Vipul
> 
> 
> qsub -N velsyn -pe orte2 10 -V -b y -cwd -j y -o $cwd/a -l "os=redhat6.7*" -q 
> all /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --N 1  -x 
> LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH 
> --merge-stderr-to-stdout --output-filename 
> ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca 
> orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca 
> plm_rsh_no_tree_spawn 1 
> 
> 
> $ /build/openmpi/openmpi-4.0.1/rhel6/bin/mpirun --hostfile host.txt -x 
> VMW_HOME=$VMW_HOME -x VMW_BIN=$VMW_BIN -x 
> LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib -x PATH=$PATH 
> --merge-stderr-to-stdout --output-filename 
> ./veloce.log/velsyn/dvelsyn:nojobid,nocopy -np 5 --mca 
> orte_base_help_aggregate 0 --mca plm_base_verbose 5 --mca 
> plm_rsh_no_tree_spawn 1 
> 
> [sox3:24416] [[26562,0],0] plm:rsh: final template argv:
>/usr/bin/ssh  set path = ( 
> /build/openmpi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH == 1 
> ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) setenv LD_LIBRARY_PATH 
> /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( $?OMPI_have_llp == 1 ) setenv 
> LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if 
> ( $?DYLD_LIBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 
> 0 ) setenv DYLD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( 
> $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH 
> /build/openmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ;   
> /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca ess "env" -mca 
> ess_base_jobid "1740767232" -mca ess_base_vpid "" -mca 
> ess_base_num_procs "6" -mca orte_node_regex 
> "sox[1:3],bos[1:3],bos[2:15],bos[1:9],bos[2:12],bos[1:7]@0(6)" -mca 
> orte_hnp_uri "1740767232.0;tcp://147.34.216.21:54496" --mca 
> orte_base_help_aggregate "0" --mca plm_base_verbose "5" --mca 
> plm_rsh_no_tree_spawn "1" -mca plm "rsh" -mca orte_output_filename 
> "./veloce.log/velsyn/dvelsyn:nojobid,nocopy" -mca pmix "^s1,s2,cray,isolated" 
>  
> [sox3:24416] [[26562,0],0] complete_setup on job [26562,1]
> [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> [[26562,0],5]
> [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> [26562,1]
> [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> [[26562,0],4]
> [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> [26562,1]
> [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> [[26562,0],1]
> [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> [26562,1]
> [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> [[26562,0],2]
> [sox3:24416] [[26562,0],0] plm:base:receive got update_proc_state for job 
> [26562,1]
> [sox3:24416] [[26562,0],0] plm:base:receive update proc state command from 
> [[26562,0],3]
> [sox3:24416] [[2

Re: [OMPI users] Running mpirun with grid

2020-06-01 Thread Ralph Castain via users
Afraid I don't have much to offer. I suspect the problem is here:

> Unmatched ".
> Unmatched ".
> Unmatched ".

Something may be eating a quote, or mistakenly adding one, to the cmd line. You 
might try upping the verbosity: --mca plm_base_verbose 5



> On May 31, 2020, at 2:49 PM, Kulshrestha, Vipul 
>  wrote:
> 
> Hi Ralph,
> 
> Thanks for your response.
> 
> I added the option "--mca plm_rsh_no_tree_spawn 1" to mpirun command line, 
> but I get a similar error. (pasted below).
> 
> Regards,
> Vipul
> 
> Got 14 slots.
> tmpdir is /tmp/194954128.1.all.q
> pe_hostfile is /var/spool/sge/has2/active_jobs/194954128.1/pe_hostfile
> has2.org.com 2 al...@has2.org.com 
> has6.org.com 2 al...@has6.org.com 
> cod4.org.com 2 al...@cod4.org.com 
> cod6.org.com 2 al...@cod6.org.com 
> cod5.org.com 2 al...@cod5.org.com 
> hpb12.org.com 2 al...@hpb12.org.com 
> has4.org.com 2 al...@has4.org.com 
> [has2:08703] [[24953,0],0] plm:rsh: using "/grid2/sge/bin/lx-amd64/qrsh 
> -inherit -nostdin -V -verbose" for launching
> [has2:08703] [[24953,0],0] plm:rsh: final template argv:
>/grid2/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose   
>set path = ( /build/openmpi/openmpi-4.0.1/rhel6/bin $path ) ; if ( 
> $?LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) 
> setenv LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( 
> $?OMPI_have_llp == 1 ) setenv LD_LIBRARY_PATH 
> /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; if ( 
> $?DYLD_LIBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 
> ) setenv DYLD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( 
> $?OMPI_have_dllp == 1 ) setenv DYLD_LIBRARY_PATH 
> /build/openmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ;   
> /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca ess "env" -mca 
> ess_base_jobid "1635319808" -mca ess_base_vpid "" -mca 
> ess_base_num_procs "7" -mca orte_node_regex 
> "has[1:2,6],cod[1:4,6,5],hpb[2:12],has[1:4]@0(7)" -mca orte_hnp_uri 
> "1635319808.0;tcp://139.181.79.58:57879" --mca routed "direct" --mca 
> orte_base_help_aggregate "0" --mca plm_base_verbose "1" --mca 
> plm_rsh_no_tree_spawn "1" -mca plm "rsh" -mca orte_output_filename 
> "./veloce.log/velsyn/dvelsyn:nojobid,nocopy" -mca hwloc_base_binding_policy 
> "none" -mca pmix "^s1,s2,cray,isolated"
> Starting server daemon at host "cod5"Starting server daemon at host 
> "cod6"Starting server daemon at host "has4"Starting server daemon at host "co
> d4"
> 
> 
> 
> Starting server daemon at host "hpb12"Starting server daemon at host "has6"
> 
> Server daemon successfully started with task id "1.cod4"
> Server daemon successfully started with task id "1.cod5"
> Server daemon successfully started with task id "1.cod6"
> Server daemon successfully started with task id "1.has6"
> Server daemon successfully started with task id "1.hpb12"
> Server daemon successfully started with task id "1.has4"
> Unmatched ".
> Unmatched ".
> Unmatched ".
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>  settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>  Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>  Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>  (e.g., on Cray). Please check your configure cmd line and consider using
>  one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>  lack of common network interfaces and/or no route found between
>  them. Please check network connectivity (including firewalls
>  and network routing requirements).
> --
> --
> 
> 
> 
> 
> 
> 
> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ralph 
> Castain via users

Re: [OMPI users] Running mpirun with grid

2020-05-31 Thread Ralph Castain via users
The messages about the daemons is coming from two different sources. Grid is 
saying it was able to spawn the orted - then the orted is saying it doesn't 
know how to communicate and fails.

I think the root of the problem lies in the plm output that shows the qrsh it 
will use to start the job. For some reason, mpirun is still trying to "tree 
spawn", which (IIRC) isn't allowed on grid (all the daemons have to be launched 
in one shot by mpirun using qrsh). Try adding "--mca plm_rsh_no_tree_spawn 1" 
to your mpirun cmd line.


>> 
>> 
>> On Sat, 30 May 2020 at 00:41, Kulshrestha, Vipul via users 
>>  wrote:
>>> 
>>> Hi,
>>> 
>>> 
>>> 
>>> I need to launch my openmpi application on grid. My application is designed 
>>> to run N processes, where each process would have M threads. I am using 
>>> open MPI version 4.0.1
>>> 
>>> 
>>> 
>>> % /build/openmpi/openmpi-4.0.1/rhel6/bin/ompi_info | grep grid
>>> 
>>> MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component 
>>> v4.0.1)
>>> 
>>> 
>>> 
>>> To run it without grid, I run it as (say N = 7, M = 2)
>>> 
>>> % mpirun –np 7 
>>> 
>>> 
>>> 
>>> The above works well and runs N processes. Based on some earlier advice on 
>>> this forum, I have setup the grid submission using the a grid job 
>>> submission script that modifies the grid slow allocation, so that mpirun 
>>> launches only 1 application process copy on each host allocated by grid. I 
>>> have some partial success. I think grid is able to start the job and then 
>>> mpirun also starts to run, but then it errors out with below mentioned 
>>> errors. Strangely, after giving message for having started all the daemons, 
>>> it reports that it was not able to start one or more daemons.
>>> 
>>> 
>>> 
>>> I have setup a grid submission script that modifies the pe_hostfile and it 
>>> appears that mpirun is able to take it and then is able use the host 
>>> information to start launching the jobs. However, mpirun halts before it 
>>> can start all the child processes. I enabled some debug logs but am not 
>>> able to figure out a possible cause.
>>> 
>>> 
>>> 
>>> Could somebody look at this and guide how to resolve this issue?
>>> 
>>> 
>>> 
>>> I have pasted the detailed log as well as my job submission script below.
>>> 
>>> 
>>> 
>>> As a clarification, when I run the mpirun without grid, it (mpirun and my 
>>> application) works on the same set of hosts without any problems.
>>> 
>>> 
>>> 
>>> Thanks,
>>> 
>>> Vipul
>>> 
>>> 
>>> 
>>> Job submission script:
>>> 
>>> #!/bin/sh
>>> 
>>> #$ -N velsyn
>>> 
>>> #$ -pe orte2 14
>>> 
>>> #$ -V -cwd -j y
>>> 
>>> #$ -o out.txt
>>> 
>>> #
>>> 
>>> echo "Got $NSLOTS slots."
>>> 
>>> echo "tmpdir is $TMPDIR"
>>> 
>>> echo "pe_hostfile is $PE_HOSTFILE"
>>> 
>>> 
>>> 
>>> 
>>> 
>>> cat $PE_HOSTFILE
>>> 
>>> newhostfile=/testdir/tmp/pe_hostfile
>>> 
>>> 
>>> 
>>> awk '{$2 = $2/2; print}' $PE_HOSTFILE > $newhostfile
>>> 
>>> 
>>> 
>>> export PE_HOSTFILE=$newhostfile
>>> 
>>> export LD_LIBRARY_PATH=/build/openmpi/openmpi-4.0.1/rhel6/lib
>>> 
>>> 
>>> 
>>> mpirun --merge-stderr-to-stdout --output-filename ./output:nojobid,nocopy 
>>> --mca routed direct --mca orte_base_help_aggregate 0 --mca plm_base_verbose 
>>> 1 --bind-to none --report-bindings -np 7 
>>> 
>>> 
>>> 
>>> The out.txt content is:
>>> 
>>> Got 14 slots.
>>> 
>>> tmpdir is /tmp/182117160.1.all.q
>>> 
>>> pe_hostfile is /var/spool/sge/bos2/active_jobs/182117160.1/pe_hostfile
>>> 
>>> bos2.wv.org.com 2 al...@bos2.wv.org.com  art8.wv.org.com 2 
>>> al...@art8.wv.org.com  art10.wv.org.com 2 al...@art10.wv.org.com 
>>>  hpb7.wv.org.com 2 al...@hpb7.wv.org.com  bos15.wv.org.com 2 
>>> al...@bos15.wv.org.com  bos1.wv.org.com 2 al...@bos1.wv.org.com 
>>>  hpb11.wv.org.com 2 al...@hpb11.wv.org.com  [bos2:22657] 
>>> [[8251,0],0] plm:rsh: using "/wv/grid2/sge/bin/lx-amd64/qrsh -inherit 
>>> -nostdin -V -verbose" for launching [bos2:22657] [[8251,0],0] plm:rsh: 
>>> final template argv:
>>> 
>>>  /grid2/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose  
>>> set path = ( /build/openm
>>> 
>>> pi/openmpi-4.0.1/rhel6/bin $path ) ; if ( $?LD_LIBRARY_PATH == 1 ) set 
>>> OMPI_have_llp ; if ( $?LD_LIBR ARY_PATH == 0 ) setenv LD_LIBRARY_PATH 
>>> /build/openmpi/openmpi-4.0.1/rhel6/lib ; if ( $?OMPI_have_llp == 1 ) setenv 
>>> LD_LIBRARY_PATH /build/openmpi/openmpi-4.0.1/rhel6/lib:$LD_LIBRARY_PATH ; 
>>> if ( $?DYLD_L IBRARY_PATH == 1 ) set OMPI_have_dllp ; if ( 
>>> $?DYLD_LIBRARY_PATH == 0 ) setenv DYLD_LIBRARY_PATH /bui 
>>> ld/openmpi/openmpi-4.0.1/rhel6/lib ; if ( $?OMPI_have_dllp == 1 ) setenv 
>>> DYLD_LIBRARY_PATH /build/ope
>>> 
>>> nmpi/openmpi-4.0.1/rhel6/lib:$DYLD_LIBRARY_PATH ;   
>>> /build/openmpi/openmpi-4.0.1/rhel6/bin/orted -mca
>>> 
>>> orte_report_bindings "1" -mca ess "env" -mca ess_base_jobid "540737536" 
>>> -mca ess_base_vpid ">> 
>>> e>" -mca ess_base_num_procs "7" -mca orte_node_regex
>>> 
>>> e>"bos[1:2],art[1:8],art[2:10],hpb[1:7],bos[2:15],
>>> 
>>> bos[1:1],hpb[2:11]@0(7)" 

Re: [OMPI users] I can't build openmpi 4.0.X using PMIx 3.1.5 to use with Slurm

2020-05-12 Thread Ralph Castain via users
Try adding --without-psm2 to the PMIx configure line - sounds like you have 
that library installed on your machine, even though you don't have omnipath.


On May 12, 2020, at 4:42 AM, Leandro via users mailto:users@lists.open-mpi.org> > wrote:

HI, 

I compile it statically to make sure compilers libraries will not be a 
dependency, and I do this way for years. The developers said they want this 
way, so I did.

I saw this warning and this library is related with omnipath, which we don't 
have.

---
Leandro


On Tue, May 12, 2020 at 8:27 AM Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com> > wrote:
It looks like you are building both static and dynamic libraries 
(--enable-static and --enable-shared).  This might be confusing the issue -- I 
can see at least one warning:

icc: warning #10237: -lcilkrts linked in dynamically, static library not 
available

It's not easy to tell from the snippets you sent what other downstream side 
effects this might have.

Is there a reason to compile statically?  It generally leads to (much) bigger 
executables, and far less memory efficiency (i.e., the library is not shared in 
memory between all the MPI processes running on each node).  Also, the link 
phase of compilers tends to prefer shared libraries, so unless your apps are 
compiled/linked with whatever the compiler's "link this statically" flags are, 
it's going to likely default to using the shared libraries.

This is a long way of saying: try building everything with just --enable-shared 
(and not --enable-static).  Or possibly just remove both flags; --enable-shared 
is the default.




On May 11, 2020, at 9:23 AM, Leandro via users mailto:users@lists.open-mpi.org> > wrote:

Hi,

I'm trying to start using Slurm, and I followed all the instructions ti build 
PMIx, Slurm using pmix, but I can't make openmpi to work.

According to PMIx documentation, I should compile openmpi using 
"--with-ompi-pmix-rte" but when I tried, It fails. I need to build this as 
CentOS rpms.

Thanks in advance for your help. I pasted some info below.

libtool: link: 
/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc
 -std=gnu99 -std=gnu99 -DOPAL_CONFIGURE_USER=\"root\" 
-DOPAL_CONFIGURE_HOST=\"gr10b17n05\" "-DOPAL_CONFIGURE_DATE=\"Fri May  8 
13:35:51 -03 2020\"" -DOMPI_BUILD_USER=\"root\" 
-DOMPI_BUILD_HOST=\"gr10b17n05\" "-DOMPI_BUILD_DATE=\"Fri May  8 13:47:32 -03 
2020\"" "-DOMPI_BUILD_CFLAGS=\"-DNDEBUG -O3 -finline-functions 
-fno-strict-aliasing -restrict -Qoption,cpp,--extended_float_types -pthread\"" 
"-DOMPI_BUILD_CPPFLAGS=\"-I../../.. -I../../../orte/include    \"" 
"-DOMPI_BUILD_CXXFLAGS=\"-DNDEBUG -O3 -finline-functions -pthread\"" 
"-DOMPI_BUILD_CXXCPPFLAGS=\"-I../../..  \"" "-DOMPI_BUILD_FFLAGS=\"-O2 -g -pipe 
-Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong 
--param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic 
-I/usr/lib64/gfortran/modules\"" -DOMPI_BUILD_FCFLAGS=\"-O3\" 
"-DOMPI_BUILD_LDFLAGS=\"-Wc,-static-intel -static-intel    -L/usr/lib64\"" 
"-DOMPI_BUILD_LIBS=\"-lrt -lutil  -lz  -lhwloc  -levent -levent_pthreads\"" 
-DOPAL_CC_ABSOLUTE=\"\" -DOMPI_CXX_ABSOLUTE=\"none\" -DNDEBUG -O3 
-finline-functions -fno-strict-aliasing -restrict 
-Qoption,cpp,--extended_float_types -pthread -static-intel -static-intel -o 
.libs/ompi_info ompi_info.o param.o  -L/usr/lib64 ../../../ompi/.libs/libmpi.so 
-L/usr/lib -llustreapi 
/root/rpmbuild/BUILD/openmpi-4.0.2/opal/.libs/libopen-pal.so 
../../../opal/.libs/libopen-pal.so -lfabric -lucp -lucm -lucs -luct -lrdmacm 
-libverbs /usr/lib64/libpmix.so -lmunge -lrt -lutil -lz /usr/lib64/libhwloc.so 
-lm -ludev -lltdl -levent -levent_pthreads -pthread -Wl,-rpath -Wl,/usr/lib64
icc: warning #10237: -lcilkrts linked in dynamically, static library not 
available
../../../ompi/.libs/libmpi.so: undefined reference to `orte_process_info'
../../../ompi/.libs/libmpi.so: undefined reference to `orte_show_help'

make[2]: *** [ompi_info] Error 1
make[2]: Leaving directory 
`/root/rpmbuild/BUILD/openmpi-4.0.2/ompi/tools/ompi_info'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/rpmbuild/BUILD/openmpi-4.0.2/ompi'
make: *** [all-recursive] Error 1
error: Bad exit status from /var/tmp/rpm-tmp.RyklCR (%build)

The orte libraries are missing. When I don't use "-with-ompi-pmix-rte" it 
builds, but neither mpirun or srun works:

c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > cat machine_file 
gr10b17n05
gr10b17n06
gr10b17n07
gr10b17n08
c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun -machinefile 
machine_file ./mpihello 
[gr10b17n07:115065] [[21391,0],2] ORTE_ERROR_LOG: Not found in file 
base/ess_base_std_orted.c at line 362
--
It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to c

Re: [OMPI users] I can't build openmpi 4.0.X using PMIx 3.1.5 to use with Slurm

2020-05-11 Thread Ralph Castain via users
I'm not sure I understand why you are trying to build CentOS rpms for PMIx, 
Slurm, or OMPI - all three are readily available online. Is there some 
particular reason you are trying to do this yourself? I ask because it is 
non-trivial to do and requires significant familiarity with both the 
intricacies of bpm building and the packages involved.


On May 11, 2020, at 6:23 AM, Leandro via users mailto:users@lists.open-mpi.org> > wrote:

Hi,

I'm trying to start using Slurm, and I followed all the instructions ti build 
PMIx, Slurm using pmix, but I can't make openmpi to work.

According to PMIx documentation, I should compile openmpi using 
"--with-ompi-pmix-rte" but when I tried, It fails. I need to build this as 
CentOS rpms.

Thanks in advance for your help. I pasted some info below.

libtool: link: 
/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc
 -std=gnu99 -std=gnu99 -DOPAL_CONFIGURE_USER=\"root\" 
-DOPAL_CONFIGURE_HOST=\"gr10b17n05\" "-DOPAL_CONFIGURE_DATE=\"Fri May  8 
13:35:51 -03 2020\"" -DOMPI_BUILD_USER=\"root\" 
-DOMPI_BUILD_HOST=\"gr10b17n05\" "-DOMPI_BUILD_DATE=\"Fri May  8 13:47:32 -03 
2020\"" "-DOMPI_BUILD_CFLAGS=\"-DNDEBUG -O3 -finline-functions 
-fno-strict-aliasing -restrict -Qoption,cpp,--extended_float_types -pthread\"" 
"-DOMPI_BUILD_CPPFLAGS=\"-I../../.. -I../../../orte/include    \"" 
"-DOMPI_BUILD_CXXFLAGS=\"-DNDEBUG -O3 -finline-functions -pthread\"" 
"-DOMPI_BUILD_CXXCPPFLAGS=\"-I../../..  \"" "-DOMPI_BUILD_FFLAGS=\"-O2 -g -pipe 
-Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong 
--param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic 
-I/usr/lib64/gfortran/modules\"" -DOMPI_BUILD_FCFLAGS=\"-O3\" 
"-DOMPI_BUILD_LDFLAGS=\"-Wc,-static-intel -static-intel    -L/usr/lib64\"" 
"-DOMPI_BUILD_LIBS=\"-lrt -lutil  -lz  -lhwloc  -levent -levent_pthreads\"" 
-DOPAL_CC_ABSOLUTE=\"\" -DOMPI_CXX_ABSOLUTE=\"none\" -DNDEBUG -O3 
-finline-functions -fno-strict-aliasing -restrict 
-Qoption,cpp,--extended_float_types -pthread -static-intel -static-intel -o 
.libs/ompi_info ompi_info.o param.o  -L/usr/lib64 ../../../ompi/.libs/libmpi.so 
-L/usr/lib -llustreapi 
/root/rpmbuild/BUILD/openmpi-4.0.2/opal/.libs/libopen-pal.so 
../../../opal/.libs/libopen-pal.so -lfabric -lucp -lucm -lucs -luct -lrdmacm 
-libverbs /usr/lib64/libpmix.so -lmunge -lrt -lutil -lz /usr/lib64/libhwloc.so 
-lm -ludev -lltdl -levent -levent_pthreads -pthread -Wl,-rpath -Wl,/usr/lib64
icc: warning #10237: -lcilkrts linked in dynamically, static library not 
available
../../../ompi/.libs/libmpi.so: undefined reference to `orte_process_info'
../../../ompi/.libs/libmpi.so: undefined reference to `orte_show_help'

make[2]: *** [ompi_info] Error 1
make[2]: Leaving directory 
`/root/rpmbuild/BUILD/openmpi-4.0.2/ompi/tools/ompi_info'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/rpmbuild/BUILD/openmpi-4.0.2/ompi'
make: *** [all-recursive] Error 1
error: Bad exit status from /var/tmp/rpm-tmp.RyklCR (%build)

The orte libraries are missing. When I don't use "-with-ompi-pmix-rte" it 
builds, but neither mpirun or srun works:

c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > cat machine_file 
gr10b17n05
gr10b17n06
gr10b17n07
gr10b17n08
c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun -machinefile 
machine_file ./mpihello 
[gr10b17n07:115065] [[21391,0],2] ORTE_ERROR_LOG: Not found in file 
base/ess_base_std_orted.c at line 362
--
It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common ne

Re: [OMPI users] can't open /dev/ipath, network down (err=26)

2020-05-08 Thread Ralph Castain via users
I fear those cards are past end-of-life so far as support is concerned. I'm not 
sure if anyone can really advise you on them. It sounds like the fabric is 
experiencing failures, but that's just a guess.


On May 8, 2020, at 12:56 PM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org> > wrote:

 

We often get the following errors when more than one job runs on the same 
compute node. We are using Slurm with OpenMPI. The IB cards are QLogic using 
PSM: 
 

10698ipath_userinit: assign_context command failed: Network is down
 node01.10698can't open /dev/ipath, network down (err=26)
 node01.10703ipath_userinit: assign_context command failed: Network is down
 node01.10703can't open /dev/ipath, network down (err=26)
 node01.10701ipath_userinit: assign_context command failed: Network is down
 node01.10701can't open /dev/ipath, network down (err=26)
 node01.10700ipath_userinit: assign_context command failed: Network is down
 node01.10700can't open /dev/ipath, network down (err=26)
 node01.10697ipath_userinit: assign_context command failed: Network is down
 node01.10697can't open /dev/ipath, network down (err=26)
--
 PSM was unable to open an endpoint. Please make sure that the network link is
 active on the node and the hardware is functioning. 
 
 Error: Could not detect network connectivity
--

Any Ideas how to fix this? 
 

 

-- 
Prentice 

 



Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-05-06 Thread Ralph Castain via users
The following (from what you posted earlier):

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

would indicate that Slurm was built against a PMIx v3.x release. Using OMPI 
v4.0.3 with pmix=internal should be just fine so long as you set --mpi=pmix_v3

I'm somewhat at a loss as to what might be wrong. Try adding 
"OMPI_MCA_pmix_base_verbose=5 PMIX_MCA_pmix_client_get_verbose=5" to your 
environment and see what it says. You also should build OMPI with 
--enable-debug to ensure you get all the available debug output.


On May 6, 2020, at 1:26 PM, Prentice Bisbal via users mailto:users@lists.open-mpi.org> > wrote:

No, and I fear that may be the problem. When we built OpenMPI, we did 
--with-pmix=internal. Not sure how Slurm was built, since my coworker built it. 

Prentice 

On 4/28/20 2:07 AM, Daniel Letai via users wrote:
I know it's not supposed to matter, but have you tried building both ompi and 
slurm against same pmix? That is - first build pmix, than build slurm 
with-pmix, and than ompi with both slurm and pmix=external ?


On 23/04/2020 17:00, Prentice Bisbal via users wrote:

$ ompi_info | grep slurm 
  Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm' 
 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3) 
 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3) 
 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3) 
  MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3) 

Any ideas what could be wrong? Do you need any additional information? 

Prentice 


-- 
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov




Re: [OMPI users] Can't start jobs with srun.

2020-04-26 Thread Ralph Castain via users
It is entirely possible that the PMI2 support in OMPI v4 is broken - I doubt it 
is used or tested very much as pretty much everyone has moved to PMIx. In fact, 
we completely dropped PMI-1 and PMI-2 from OMPI v5 for that reason.

I would suggest building Slurm with PMIx v3.1.5 
(https://github.com/openpmix/openpmix/releases/tag/v3.1.5) as that is what OMPI 
v4 is using, and launching with "srun --mpi=pmix_v3"


On Apr 26, 2020, at 10:07 AM, Patrick Bégou via users mailto:users@lists.open-mpi.org> > wrote:

I have also this problem on servers I'm benching at DELL's lab with
OpenMPI-4.0.3. I've tried  a new build of OpenMPI with "--with-pmi2". No
change.
Finally my work around in the slurm script was to launch my code with
mpirun. As mpirun was only finding one slot per nodes I have used
"--oversubscribe --bind-to core" and checked that every process was
binded on a separate core. It worked but do not ask me why :-)

Patrick

Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
Prentice, have you tried something trivial, like "srun -N3 hostname", to rule 
out non-OMPI problems?

Andy

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org 
<mailto:users-boun...@lists.open-mpi.org> ] On Behalf Of Prentice Bisbal via 
users
Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain mailto:r...@open-mpi.org> >; Open MPI 
Users mailto:users@lists.open-mpi.org> >
Cc: Prentice Bisbal mailto:pbis...@pppl.gov> >
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be 
running, but doesn't do anything - it just hangs in the running state 

but doesn't do anything. Any ideas what could be wrong, or how to debug 
this?

I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:
You can trust the --mpi=list. The problem is likely that OMPI wasn't configured 
--with-pmi2


On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org> > wrote:

--mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
either of them, my job still fails. Why is that? Can I not trust the output of 
--mpi=list?

Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:
No, but you do have to explicitly build OMPI with non-PMIx support if that is 
what you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.


On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org> > wrote:

It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:
Is Slurm built with PMIx support? Did you tell srun to use it?


On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org> > wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

***    and potentially your MPI job)
[dawson029.pppl.gov:26070 <http://dawson029.pppl.gov:26070> ] Local abort 
before MPI_INIT completed completed successfully, but am not able to aggregate 
error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

***    and potentially your MPI job)
[dawson029.pppl.gov:26076 <http://dawson029.pppl.gov:26076> ] Local abort 
before MPI_INIT completed completed successfully, but am not able to aggregate 
error messages, and not able to guarantee that all other processes were killed!

If I run the same job, but use mpiexec or mpirun instead of srun, the jobs run 
just fine. I checked ompi_info to make sure OpenMPI was compiled with  Slurm 
support:

$ ompi_info | grep slurm
   Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'
  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
  MC

Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-23 Thread Ralph Castain via users
You can trust the --mpi=list. The problem is likely that OMPI wasn't configured 
--with-pmi2


> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
>  wrote:
> 
> --mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
> either of them, my job still fails. Why is that? Can I not trust the output 
> of --mpi=list?
> 
> Prentice
> 
> On 4/23/20 10:43 AM, Ralph Castain via users wrote:
>> No, but you do have to explicitly build OMPI with non-PMIx support if that 
>> is what you are going to use. In this case, you need to configure OMPI 
>> --with-pmi2=
>> 
>> You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed 
>> in a standard location as we should find it there.
>> 
>> 
>>> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
>>>  wrote:
>>> 
>>> It looks like it was built with PMI2, but not PMIx:
>>> 
>>> $ srun --mpi=list
>>> srun: MPI types are...
>>> srun: none
>>> srun: pmi2
>>> srun: openmpi
>>> 
>>> I did launch the job with srun --mpi=pmi2 
>>> 
>>> Does OpenMPI 4 need PMIx specifically?
>>> 
>>> 
>>> On 4/23/20 10:23 AM, Ralph Castain via users wrote:
>>>> Is Slurm built with PMIx support? Did you tell srun to use it?
>>>> 
>>>> 
>>>>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
>>>>>  wrote:
>>>>> 
>>>>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with 
>>>>> a very simple hello, world MPI program that I've used reliably for years. 
>>>>> When I submit the job through slurm and use srun to launch the job, I get 
>>>>> these errors:
>>>>> 
>>>>> *** An error occurred in MPI_Init
>>>>> *** on a NULL communicator
>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>>> ***and potentially your MPI job)
>>>>> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed 
>>>>> completed successfully, but am not able to aggregate error messages, and 
>>>>> not able to guarantee that all other processes were killed!
>>>>> *** An error occurred in MPI_Init
>>>>> *** on a NULL communicator
>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>>> ***and potentially your MPI job)
>>>>> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed 
>>>>> completed successfully, but am not able to aggregate error messages, and 
>>>>> not able to guarantee that all other processes were killed!
>>>>> 
>>>>> If I run the same job, but use mpiexec or mpirun instead of srun, the 
>>>>> jobs run just fine. I checked ompi_info to make sure OpenMPI was compiled 
>>>>> with  Slurm support:
>>>>> 
>>>>> $ ompi_info | grep slurm
>>>>>   Configure command line: 
>>>>> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
>>>>> '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
>>>>> '--with-slurm' '--with-psm'
>>>>>  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
>>>>>  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>>>>>  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>>>>>   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)
>>>>> 
>>>>> Any ideas what could be wrong? Do you need any additional information?
>>>>> 
>>>>> Prentice
>>>>> 
>> 




Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-23 Thread Ralph Castain via users
No, but you do have to explicitly build OMPI with non-PMIx support if that is 
what you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.


> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
>  wrote:
> 
> It looks like it was built with PMI2, but not PMIx:
> 
> $ srun --mpi=list
> srun: MPI types are...
> srun: none
> srun: pmi2
> srun: openmpi
> 
> I did launch the job with srun --mpi=pmi2 
> 
> Does OpenMPI 4 need PMIx specifically?
> 
> 
> On 4/23/20 10:23 AM, Ralph Castain via users wrote:
>> Is Slurm built with PMIx support? Did you tell srun to use it?
>> 
>> 
>>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
>>>  wrote:
>>> 
>>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
>>> very simple hello, world MPI program that I've used reliably for years. 
>>> When I submit the job through slurm and use srun to launch the job, I get 
>>> these errors:
>>> 
>>> *** An error occurred in MPI_Init
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***and potentially your MPI job)
>>> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed completed 
>>> successfully, but am not able to aggregate error messages, and not able to 
>>> guarantee that all other processes were killed!
>>> *** An error occurred in MPI_Init
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***and potentially your MPI job)
>>> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed completed 
>>> successfully, but am not able to aggregate error messages, and not able to 
>>> guarantee that all other processes were killed!
>>> 
>>> If I run the same job, but use mpiexec or mpirun instead of srun, the jobs 
>>> run just fine. I checked ompi_info to make sure OpenMPI was compiled with  
>>> Slurm support:
>>> 
>>> $ ompi_info | grep slurm
>>>   Configure command line: 
>>> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' '--disable-silent-rules' 
>>> '--enable-shared' '--with-pmix=internal' '--with-slurm' '--with-psm'
>>>  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
>>>  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>>>  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>>>   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)
>>> 
>>> Any ideas what could be wrong? Do you need any additional information?
>>> 
>>> Prentice
>>> 
>> 




Re: [OMPI users] Can't start jobs with srun.

2020-04-23 Thread Ralph Castain via users
Is Slurm built with PMIx support? Did you tell srun to use it?


> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
>  wrote:
> 
> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
> very simple hello, world MPI program that I've used reliably for years. When 
> I submit the job through slurm and use srun to launch the job, I get these 
> errors:
> 
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed completed 
> successfully, but am not able to aggregate error messages, and not able to 
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed completed 
> successfully, but am not able to aggregate error messages, and not able to 
> guarantee that all other processes were killed!
> 
> If I run the same job, but use mpiexec or mpirun instead of srun, the jobs 
> run just fine. I checked ompi_info to make sure OpenMPI was compiled with  
> Slurm support:
> 
> $ ompi_info | grep slurm
>   Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
> '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
> '--with-slurm' '--with-psm'
>  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
>  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)
> 
> Any ideas what could be wrong? Do you need any additional information?
> 
> Prentice
> 




Re: [OMPI users] Meaning of mpiexec error flags

2020-04-14 Thread Ralph Castain via users
Then those flags are correct. I suspect mpirun is executing on n006, yes? The 
"location verified" just means that the daemon of rank N reported back from the 
node we expected it to be on - Slurm and Cray sometimes renumber the ranks. 
Torque doesn't and so you should never see a problem. Since mpirun isn't 
launched by itself, its node is never "verified", though I probably should 
alter that as it is obviously in the "right" place.

I don't know what you mean by your app isn't behaving correctly on the remote 
nodes - best guess is that perhaps some envar they need isn't being forwarded?


On Apr 14, 2020, at 2:04 AM, Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov> > wrote:

CentOS, Torque.
   From: Ralph Castain mailto:r...@open-mpi.org> > 
Sent: Monday, April 13, 2020 5:44 PM
To: Mccall, Kurt E. (MSFC-EV41) mailto:kurt.e.mcc...@nasa.gov> >
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

 What kind of system are you running on? Slurm? Cray? ...?
 

On Apr 13, 2020, at 3:11 PM, Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov> > wrote:
 Thanks Ralph.   So the difference between the working node flag (0x11) and the 
non-working nodes’ flags (0x13) is the flagPRRTE_NODE_FLAG_LOC_VERIFIED.    
What does that imply?   The location of the daemon has NOT been verified?
 Kurt
 From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users
Sent: Monday, April 13, 2020 4:47 PM
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Ralph Castain mailto:r...@open-mpi.org> >
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

 I updated the message to explain the flags (instead of a numerical value) for 
OMPI v5. In brief:
 #define PRRTE_NODE_FLAG_DAEMON_LAUNCHED    0x01   // whether or not the daemon 
on this node has been launched
#define PRRTE_NODE_FLAG_LOC_VERIFIED               0x02   // whether or not the 
location has been verified - used for
                                                                                
                      // environments where the daemon's final destination is 
uncertain
#define PRRTE_NODE_FLAG_OVERSUBSCRIBED       0x04   // whether or not this node 
is oversubscribed
#define PRRTE_NODE_FLAG_MAPPED                         0x08   // whether we 
have been added to the current map
#define PRRTE_NODE_FLAG_SLOTS_GIVEN               0x10   // the number of slots 
was specified - used only in non-managed environments
#define PRRTE_NODE_NON_USABLE                           0x20   // the node is 
hosting a tool and is NOT to be used for jobs
  


On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:
 My application is behaving correctly on node n006, and incorrectly on the 
lower numbered nodes.   The flags in the error message below may give a clue as 
to why.   What is the meaning of the flag values 0x11 and 0x13?
 ==   ALLOCATED NODES   ==
    n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
    n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
 I’m using OpenMPI 4.0.3.
 Thanks,
Kurt



Re: [OMPI users] Meaning of mpiexec error flags

2020-04-13 Thread Ralph Castain via users
I updated the message to explain the flags (instead of a numerical value) for 
OMPI v5. In brief:

#define PRRTE_NODE_FLAG_DAEMON_LAUNCHED    0x01   // whether or not the daemon 
on this node has been launched
#define PRRTE_NODE_FLAG_LOC_VERIFIED               0x02   // whether or not the 
location has been verified - used for
                                                                                
                      // environments where the daemon's final destination is 
uncertain
#define PRRTE_NODE_FLAG_OVERSUBSCRIBED       0x04   // whether or not this node 
is oversubscribed
#define PRRTE_NODE_FLAG_MAPPED                         0x08   // whether we 
have been added to the current map
#define PRRTE_NODE_FLAG_SLOTS_GIVEN               0x10   // the number of slots 
was specified - used only in non-managed environments
#define PRRTE_NODE_NON_USABLE                           0x20   // the node is 
hosting a tool and is NOT to be used for jobs



On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:

My application is behaving correctly on node n006, and incorrectly on the lower 
numbered nodes.   The flags in the error message below may give a clue as to 
why.   What is the meaning of the flag values 0x11 and 0x13?
 ==   ALLOCATED NODES   ==
    n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
    n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
    n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
 I’m using OpenMPI 4.0.3.
 Thanks,
Kurt



Re: [OMPI users] Clean termination after receiving multiple SIGINT

2020-04-06 Thread Ralph Castain via users
I don't know that it is officially documented anywhere - it does get printed 
out when the first CTRL-C arrives. On the plus side, it has been 5 seconds (as 
opposed to some other time) since the beginning of OMPI, so it is pretty safe 
to rely on it.

I wonder if you could get around this problem another way - what if you pass 
mpirun the "--set-sid" option? This would put mpirun into its own process group 
and should (I believe) ensure that it doesn't see the user's CTRL-C so you 
could always just forward the signal.

Would that work for you?


On Apr 6, 2020, at 7:57 AM, Kreutzer, Moritz via users 
mailto:users@lists.open-mpi.org> > wrote:

Thanks for the explanation, Ralph!
 I guess the reason we need to pass the signal down is to achieve correct 
behavior when a signal does not come via CTRL+C, but in case someone kills our 
top-level script (which eventually calls mpirun) using “kill $PID” or similar, 
in which case we would have to forward the signal to mpirun. I think it’s a 
convenience feature not relying on the user having to kill the entire process 
group, but also achieving the desired behavior when they just kill the 
top-level script. Maybe we manage to find a way to distinguish signals which 
are sent only to the wrapper script (which we would want to forward) from 
signals which are sent to the entire process group (which we would not want to 
forward).
  Maybe waiting for 5 seconds would also be a viable workaround. Is this time 
span documented somewhere?
  Thanks,
Moritz
 --
Moritz Kreutzer
 Siemens Digital Industries Software
Simulation and Test Solutions, Product Development, High Performance Computing
Nordostpark 3
90411 Nuremberg, Germany 
Tel.: +49 (911) 38379 8085
moritz.kreut...@siemens.com <mailto:moritz.kreut...@siemens.com> 
www.sw.siemens.com <http://www.sw.siemens.com/> 
  From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users
Sent: Montag, 6. April 2020 16:32
To: Open MPI Users mailto:users@lists.open-mpi.org> >
Cc: Ralph Castain mailto:r...@open-mpi.org> >
Subject: Re: [OMPI users] Clean termination after receiving multiple SIGINT
 Currently, mpirun takes that second SIGINT to mean "you seem to be stuck 
trying to cleanly abort - just die", which means mpirun exits immediately 
without doing any cleanup. The individual procs all commit suicide when they 
see their daemons go away, which is why you don't get zombies left behind...but 
it does mean that the vader files are left.
 The second SIGINT has to come within a 5 second window of the first one to 
trigger that immediate exit, so one solution would be for you to delay your 
passing of the SIGINT to mpirun for more than 5 seconds. Alternatively, you 
could just not pass the signal at all since mpirun already received it - is 
there some reason why you need to pass the signal down? Are you trying to do 
your cleanup _before_ mpirun does its?
  Guess I'm wondering: why not just trap the signal, do your cleanup, and then 
wait for mpirun to terminate?
 

On Apr 6, 2020, at 6:39 AM, Kreutzer, Moritz via users 
mailto:users@lists.open-mpi.org> > wrote:
 Hi,
 We are invoking mpirun from within a script which installs some signal 
handlers. Now, if we abort an Open MPI run with CTRL+C, the system sends SIGINT 
to the entire process group. Hence, the mpirun process receives a SIGINT from 
the system with si_code=SI_KERNEL. Additionally, our own signal handler 
intercepts SIGINT, does some clean up, and sends the SIGINT further to the 
mpirun process with si_code=SI_USER. Consequently, mpirun receives 2x SIGINT. 
This leads to unclean termination with Open MPI 4.0.3. While it does not leave 
behind any zombie processes, killing it in the described way leads to leftover 
vader shared memory segment files in /dev/shm (a known issue with Open MPI 3, 
but supposedly resolved in Open MPI 4). Also, strace shows that the mpirun 
process does not receive any SIGCHILD.
 If we remove our own signal handler (which is not our preferred option), 
mpirun receives only a single SIGINT and n times SIGCHILD (n is the number of 
processes). Also, this leads to correct clean up of vader shared memory segment 
files.
 Is it expected that the cleanup fails when mpirun receivs multiple signals at 
the same time? If yes, is the only way to guarantee proper clean up to always 
make sure that only a single signal gets propagated to mpirun?
   Thanks,
Moritz
 --
Moritz Kreutzer
 Siemens Digital Industries Software
Simulation and Test Solutions, Product Development, High Performance Computing
Nordostpark 3
90411 Nuremberg, Germany 
Tel.: +49 (911) 38379 8085
moritz.kreut...@siemens.com <mailto:moritz.kreut...@siemens.com> 
www.sw.siemens.com <http://www.sw.siemens.com/> 
  -
Siemens Industry Software GmbH; Anschrift: Franz-Geuer-Str. 10, 50823 Köln; 
Gesellschaft mit beschränkter Haftung; Geschäftsfüh

Re: [OMPI users] Clean termination after receiving multiple SIGINT

2020-04-06 Thread Ralph Castain via users
Currently, mpirun takes that second SIGINT to mean "you seem to be stuck trying 
to cleanly abort - just die", which means mpirun exits immediately without 
doing any cleanup. The individual procs all commit suicide when they see their 
daemons go away, which is why you don't get zombies left behind...but it does 
mean that the vader files are left.

The second SIGINT has to come within a 5 second window of the first one to 
trigger that immediate exit, so one solution would be for you to delay your 
passing of the SIGINT to mpirun for more than 5 seconds. Alternatively, you 
could just not pass the signal at all since mpirun already received it - is 
there some reason why you need to pass the signal down? Are you trying to do 
your cleanup _before_ mpirun does its?


Guess I'm wondering: why not just trap the signal, do your cleanup, and then 
wait for mpirun to terminate?


On Apr 6, 2020, at 6:39 AM, Kreutzer, Moritz via users 
mailto:users@lists.open-mpi.org> > wrote:

Hi,
 We are invoking mpirun from within a script which installs some signal 
handlers. Now, if we abort an Open MPI run with CTRL+C, the system sends SIGINT 
to the entire process group. Hence, the mpirun process receives a SIGINT from 
the system with si_code=SI_KERNEL. Additionally, our own signal handler 
intercepts SIGINT, does some clean up, and sends the SIGINT further to the 
mpirun process with si_code=SI_USER. Consequently, mpirun receives 2x SIGINT. 
This leads to unclean termination with Open MPI 4.0.3. While it does not leave 
behind any zombie processes, killing it in the described way leads to leftover 
vader shared memory segment files in /dev/shm (a known issue with Open MPI 3, 
but supposedly resolved in Open MPI 4). Also, strace shows that the mpirun 
process does not receive any SIGCHILD.
 If we remove our own signal handler (which is not our preferred option), 
mpirun receives only a single SIGINT and n times SIGCHILD (n is the number of 
processes). Also, this leads to correct clean up of vader shared memory segment 
files.
 Is it expected that the cleanup fails when mpirun receivs multiple signals at 
the same time? If yes, is the only way to guarantee proper clean up to always 
make sure that only a single signal gets propagated to mpirun?
   Thanks,
Moritz
 --
Moritz Kreutzer
 Siemens Digital Industries Software
Simulation and Test Solutions, Product Development, High Performance Computing
Nordostpark 3
90411 Nuremberg, Germany 
Tel.: +49 (911) 38379 8085
moritz.kreut...@siemens.com  
www.sw.siemens.com  
  -
Siemens Industry Software GmbH; Anschrift: Franz-Geuer-Str. 10, 50823 Köln; 
Gesellschaft mit beschränkter Haftung; Geschäftsführer: Dr. Erich Bürgel, 
Alexander Walter; Sitz der Gesellschaft: Köln; Registergericht: Amtsgericht 
Köln, HRB 84564; Vorsitzender des Aufsichtsrats: Jürgen Köhler 




Re: [OMPI users] mpirun CLI parsing

2020-03-30 Thread Ralph Castain via users
I'm afraid the short answer is "no" - there is no way to do that today.


> On Mar 30, 2020, at 1:45 PM, Jean-Baptiste Skutnik via users 
>  wrote:
> 
> Hello,
> 
> I am writing a wrapper around `mpirun` which requires pre-processing of the 
> user's program. To achieve this, I need to isolate the program from the 
> `mpirun` arguments on the command-line. The manual describes the program as:
> ```
>  The program executable. This is identified as the first 
> non-recognized argument to `mpirun`.
> ```
> However, it would be very unreliable to re-write my own parser and check the 
> command line, so I was wondering if there is a clean built-in way to output 
> what the argument parser understood as "the program" ?
> 
> Thanks,
> 
> Jean-Baptiste Skutnik




Re: [OMPI users] MPI_Comm_spawn: no allocated resources for the application ...

2020-03-16 Thread Ralph Castain via users
Sorry for the incredibly late reply. Hopefully, you have already managed to 
find the answer.

I'm not sure what your comm_spawn command looks like, but it appears you 
specified the host in it using the "dash_host" info-key, yes? The problem is 
that this is interpreted the same way as the "-host n001.cluster.com 
 " option on an mpiexec cmd line - which means that it 
only allocates _one_ slot to the request. If you are asking to spawn two procs, 
then you don't have adequate resources. One way to check is to only spawn one 
proc with your comm_spawn request and see if that works.

If you want to specify the host, then you need to append the number of slots to 
allocate on that host - e.g., "n001.cluster.com  :2". 
Of course, you cannot allocate more than the system provided minus the number 
currently in use. There are additional modifiers you can pass to handle 
variable numbers of slots.

HTH
Ralph


On Oct 25, 2019, at 5:30 AM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org> > wrote:

I am trying to launch a number of manager processes, one per node, and then have
each of those managers spawn, on its own same node, a number of workers.   For 
this example,
I have 2 managers and 2 workers per manager.  I'm following the instructions at 
this link
 
https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn
 to force one manager process per node.
  Here is my PBS/Torque qsub command:
 $ qsub -V -j oe -e ./stdio -o ./stdio -f -X -N MyManagerJob -l nodes=2:ppn=3  
MyManager.bash
 I expect "-l nodes=2:ppn=3" to reserve 2 nodes with 3 slots on each (one slot 
for the manager and the other two for the separately spawned workers).  The 
first  argument
is a lower-case L, not a one.
   Here is my mpiexec command within the MyManager.bash script.
 mpiexec --enable-recovery --display-map --display-allocation --mca 
mpi_param_check 1 --v --x DISPLAY --np 2  --map-by ppr:1:node  MyManager.exe
 I expect "--map-by ppr:1:node" to cause OpenMpi to launch exactly one manager 
on each node. 
   When the first worker is spawned vi MPI_Comm_spawn(), OpenMpi reports:
 ==   ALLOCATED NODES   ==
    n002: flags=0x11 slots=3 max_slots=0 slots_inuse=3 state=UP
n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
=
--
There are no allocated resources for the application:
  ./MyWorker
that match the requested mapping:
  -host: n001.cluster.com  
 Verify that you have mapped the allocated resources properly for the
indicated specification.
--
[n001:14883] *** An error occurred in MPI_Comm_spawn
[n001:14883] *** reported by process [1897594881,1]
[n001:14883] *** on communicator MPI_COMM_SELF
[n001:14883] *** MPI_ERR_SPAWN: could not spawn processes
   It the banner above, it clearly states that node n001 has 3 slots reserved
and only one slot in used at time of the spawn.   Not sure why it reports
that there are no resources for it.
 I've tried compiling OpenMpi 4.0 both with and without Torque support, and
I've tried using a an explicit host file (or not), but the error is unchanged. 
Any ideas?
 My cluster is running CentOS 7.4 and I am using the Portland Group C++ 
compiler.



Re: [OMPI users] Interpreting the output of --display-map and --display-allocation

2020-03-16 Thread Ralph Castain via users
FWIW: I have replaced those flags in the display option output with their 
string equivalent to make interpretation easier. This is available in OMPI 
master and will be included in the v5 release.



> On Nov 21, 2019, at 2:08 AM, Peter Kjellström via users 
>  wrote:
> 
> On Mon, 18 Nov 2019 17:48:30 +
> "Mccall, Kurt E. \(MSFC-EV41\) via users" 
> wrote:
> 
>> I'm trying to debug a problem with my job, launched with the mpiexec
>> options -display-map and -display-allocation, but I don't know how to
>> interpret the output.   For example,  mpiexec displays the following
>> when a job is spawned by MPI_Comm_spawn():
>> 
>> ==   ALLOCATED NODES   ==
>>n002: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
>>n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
>> 
>> Maybe the differing "flags" values have bearing on the problem, but I
>> don't know what they mean.   Are the outputs of these two options
>> documented anywhere?
> 
> I don't know of any such specific documentation but the flag values are
> defined in:
> 
> orte/util/attr.h:54 (openmpi-3.1.4)
> 
> The difference between your nodes (bit value 0x2) means:
> 
> #define ORTE_NODE_FLAG_LOC_VERIFIED   0x02   
> 
> // whether or not the location has been verified - used for
> // environments where the daemon's final destination is uncertain
> 
> I do not know what that means exactly but it is not related to pinning
> on or off.
> 
> Seems to indicate a broken launch and/or install and/or environment.
> 
> /Peter K




Re: [OMPI users] Propagating SIGINT instead of SIGTERM to children processes

2020-03-16 Thread Ralph Castain via users
Hi Nathan

Sorry for the long, long delay in responding - no reasonable excuse (just busy, 
switching over support areas, etc.). Hopefully, you already found the solution.

You can specify the signals to forward to children using an MCA parameter:

OMPI_MCA_ess_base_forward_signals=SIGINT

should do what you are seeking. You can get a list of these using the 
"ompi_info" program that comes with OpenMPI. In this case, you would find the 
above param with the following help output:

Comma-delimited list of additional signals (names or integers) to forward to 
application processes [\"none\" => forward nothing]. Signals provided by 
default include SIGTSTP, SIGUSR1, SIGUSR2, SIGABRT, SIGALRM, and SIGCONT


HTH
Ralph

> On Sep 28, 2019, at 4:42 AM, Nathan GREINER via users 
>  wrote:
> 
> Dear open-mpi users,
> 
> I am using open-mpi in conjunction with the mpi4py package to run parallel 
> simulations using python on my local machine.
> 
> I use the following idiom:
> 
>mpiexec -np 4 python myscript.py
> 
> When I hit ^C during the execution of the above command, the mpi program is 
> interrupted, and the python programs are also interrupted.
> 
> However, I get no traceback from the python programs, and more 
> problematically, the cleanup functions of these programs are not executed as 
> they should when these programs get interrupted.
> 
> The open-mpi documentation states that: "When orterun (<=> mpiexec <=> 
> mpirun) receives a SIGTERM and SIGINT, it will attempt to kill the entire job 
> by sending all processes in the job a SIGTERM, waiting a small number of 
> seconds, then sending all processes in the job a SIGKILL."
> 
> Thus, the python programs receive a SIGTERM signal instead of the SIGINT 
> signal that they would receive upon hitting ^C during an execution launched 
> with the idiom:
> 
>python myscript.py
> 
> I know that there is a way to make the python programs handle the SIGTERM 
> signal as if it was a SIGINT signal (namely, raising a KeyboardInterrupt), 
> but I would prefer to be able to configure mpiexec to propagate the SIGINT 
> signal it receives instead of sending a SIGTERM signal to its children 
> processes.
> 
> Would you know how this could be achieved?
> 
> Thank you very much for your time and help,
> 
> Nathan GREINER
> 
> PS: I am new to the open-mpi users mailing list: is this the right place and 
> way to ask such a question?
> 
> 




Re: [OMPI users] [EXTERNAL] Shmem errors on Mac OS Catalina

2020-02-06 Thread Ralph Castain via users
It is also wise to create a "tmp" directory under your home directory, and 
reset TMPDIR to point there. Avoiding use of the system tmpdir is highly 
advisable under Mac OS, especially Catalina.


On Feb 6, 2020, at 4:09 PM, Gutierrez, Samuel K. via users 
mailto:users@lists.open-mpi.org> > wrote:

Good afternoon,

I fear that files created with shm_open(), a call used in the posix shmem 
component, are not being cleaned up properly. To test this theory, can you 
please reboot your computer and try again? Rebooting should remove any 
temporary files created with shm_open().

Sam

On Feb 6, 2020, at 4:34 PM, Jin Tao via users mailto:users@lists.open-mpi.org> > wrote:

Hi,

I am running Open MPI on Mac OS Catalina and am running into an issue.

- I installed Open MPI and everything seemed to be working fine until a few 
hours ago.

- I make and run with the terminal command:

 make &&  mpiexec --mca shmem posix --oversubscribe -np 8 main.out

- When I do this, I get the following error:

opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS

- If i remove some of the compiler flags, and run as follows, I get another 
error:

schmem: posix: file name search - max attempts exceeded.cannot continue with 
posix.

- In fact, now the program fails to compile at all, and just hangs when I run 
the make and run command outlined earlier.

The program was running correctly earlier today. Could someone please advise? I 
am new to MPI and would ideally like to develop a lot of skill in using this 
library for my work.

Thanks!




  1   2   >