Re: [OMPI users] Gadget2 error 818 when using more than 1 process?

2022-02-07 Thread Diego Zuccato via users

Sorry for late answer.

I thought the same, but after more testing now I don't, since re-running 
the same code on the same data on the same node with the same parameters 
sometimes works and sometimes doesn't.

The user says it works (reliably) unmodified on other clusters.
We'll try contacting Gadget2 authors, too.

Il 27/01/2022 14:52, Jeff Squyres (jsquyres) ha scritto:

I'm afraid that without any further details, it's hard to help. I don't know 
why Gadget2 would complain about its parameters file.  From what you've stated, 
it could be a problem with the application itself.

Have you talked to the Gadget2 authors?

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Diego Zuccato via users 

Sent: Wednesday, January 26, 2022 2:06 AM
To: users@lists.open-mpi.org
Cc: Diego Zuccato
Subject: Re: [OMPI users] Gadget2 error 818 when using more than 1 process?

Il 26/01/2022 02:10, Jeff Squyres (jsquyres) via users ha scritto:


I'm afraid I don't know anything about Gadget, so I can't comment there.  How 
exactly does the application fail?

Neither did I :(
It fails saying a 'timestep' is 0, and that's usually caused by an error
in the parameters file. But the parameters file is OK, and it actually
works if the user runs it in a single process. Or even with
multithreaded runs, sometimes and on some nodes. That's quite random :(
But the runs are usually single-node (simple examples for students).


Can you try upgrading to Open MPI v4.1.2?

That would be a real mess. I'm stuck with packages provided by Debian
stable. I lack both the manpower and the knowledge to compile everything
from scratch, given the intricate relations between slurm, openmpi,
infiniband, etc. :(


What networking are you using?

Infiniband (Mellanox cards, w/ Debian-supplied drivers and support
programs) and ethernet. Infiniband is also used by IPoIB to reach the
storage servers (gluster). Some nodes lacks IB, so access to the storage
is achieved by a couple of iptables rules.



From: users  on behalf of Diego Zuccato via users 

Sent: Tuesday, January 25, 2022 5:43 AM
To: Open MPI Users
Cc: Diego Zuccato
Subject: [OMPI users] Gadget2 error 818 when using more than 1 process?

Hello all.

A user of our cluster is experiencing a weird problem that I can't pinpoint.

He does have a job script that worked well on every node. I's based on
Gadget2.

Lately, *sometimes*, the same executable with the same parameters file
works, sometimes it fails. On the same node and submitting with the same
command. On some nodes it always fails. But if it gets reduced to
sequential (asking for just one process), it completes correctly (so the
parameters file, common source of Gadget2 error 818, seems innocent).

The cluster uses SLURM and limits resources using cgroups, if that matters.

Seems most of the issues started after upgrading from openmpi 3.1.3 to
4.1.0 in september.

Maybe related, the nodes started spitting out these warnings (that IIUC
should be harmless... but I'd like to debug & resolve anyway):
-8<--
Open MPI's OFI driver detected multiple equidistant NICs from the
current process, but had insufficient information to ensure MPI
processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is
necessary to resolve this issue.
-8<--

Code is run (from the jobfile) with:
srun --mpi=pmix_v4 ./Gadget2 paramfile
(we also tried with a simple mpirun w/ no extra parameters leveraging
SLURM's integration/autodetection -- same result)

Any hints?

TIA

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


Re: [OMPI users] [EXTERNAL] openib BTL disabled when using MPI_Init_thread

2022-02-07 Thread Pritchard Jr., Howard via users
HI Jose,

I bet this device has not been tested with ucx.  

You may want to join the ucx users mail list at

https://elist.ornl.gov/mailman/listinfo/ucx-group

and ask whether this Marvell device has been tested and workarounds for 
disabling features that this device doesn't support.

Again though, you really may want to first see if the TCP btl will be good 
enough for your cluster. 

Howard

On 2/4/22, 8:03 AM, "Jose E. Roman"  wrote:

Howard,

I don't have much time now to try with --enable-debug.

The RoCE device we have is FastLinQ QL41000 Series 10/25/40/50GbE Controller
The output of ibv_devinfo is:
hca_id: qedr0
transport:  InfiniBand (0)
fw_ver: 8.20.0.0
node_guid:  2267:7cff:fe11:4a50
sys_image_guid: 2267:7cff:fe11:4a50
vendor_id:  0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid:   0
port_lmc:   0x00
link_layer: Ethernet

hca_id: qedr1
transport:  InfiniBand (0)
fw_ver: 8.20.0.0
node_guid:  2267:7cff:fe11:4a51
sys_image_guid: 2267:7cff:fe11:4a51
vendor_id:  0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt:  1
port:   1
state:  PORT_DOWN (1)
max_mtu:4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid:   0
port_lmc:   0x00
link_layer: Ethernet

Regarding UCX, we have tried with the latest version. Compilation goes 
through, but the ucv_info command gives an error:

# Memory domain: qedr0
# Component: ib
# register: unlimited, cost: 180 nsec
#   remote key: 8 bytes
#   local memory handle is required for zcopy
#
#  Transport: rc_verbs
# Device: qedr0:1
#   Type: network
#  System device: qedr0 (0)
[1643982133.674556] [kahan01:8217 :0]rc_iface.c:505  UCX ERROR 
ibv_create_srq() failed: Function not implemented
#   < failed to open interface >
#
#  Transport: ud_verbs
# Device: qedr0:1
#   Type: network
#  System device: qedr0 (0)
[qelr_create_qp:545]create qp: failed on ibv_cmd_create_qp with 22
[1643982133.681169] [kahan01:8217 :0]ib_iface.c:994  UCX ERROR 
iface=0x56074944bf10: failed to create UD QP TX wr:256 sge:6 inl:64 resp:0 RX 
wr:4096 sge:1 resp:0: Invalid argument
#   < failed to open interface >
#
# Memory domain: qedr1
# Component: ib
# register: unlimited, cost: 180 nsec
#   remote key: 8 bytes
#   local memory handle is required for zcopy
#   < no supported devices found >


Any idea what the error in ibv_create_srq() means?

Thanks for your help.
Jose



> El 3 feb 2022, a las 17:52, Pritchard Jr., Howard  
escribió:
> 
> Hi Jose,
> 
> A number of things.  
> 
> First for recent versions of Open MPI including the 4.1.x release stream, 
MPI_THREAD_MULTIPLE is supported by default.  However, some transport options 
available when using MPI_Init may not be available when requesting 
MPI_THREAD_MULTIPLE.
> You may want to let Open MPI trundle along with tcp used for inter-node 
messaging and see if your application performs well enough. For a small system 
tcp may well suffice. 
> 
> Second, if you want to pursue this further, you want to rebuild Open MPI 
with --enable-debug.  The debug output will be considerably more verbose and 
provides more info.  I think you will get  a message saying rdmacm CPC is 
excluded owing to the requested thread support level.  There may be info about 
why udcm is not selected as well.
> 
> Third, what sort of RoCE devices are available on your system?  The 
output from ibv_devinfo may be useful. 
> 
> As for UCX,  if it’s the version that came with your ubuntu release 
18.0.4 it may be pretty old.  It's likely that UCX has not been tested on the 
RoCE devices on your system.
> 
> You can run 
> 
> ucx_info -v
> 
> to 

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Gus Correa via users
This may have changed since, but these used to be relevant points.
Overall, the Open MPI FAQ have lots of good suggestions:
https://www.open-mpi.org/faq/
some specific for performance tuning:
https://www.open-mpi.org/faq/?category=tuning
https://www.open-mpi.org/faq/?category=openfabrics

1) Make sure you are not using the Ethernet TCP/IP, which is widely
available in compute nodes:

mpirun --mca btl self,sm,openib ...

https://www.open-mpi.org/faq/?category=tuning#selecting-components

However, this may have changed lately:
https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable

2) Maximum locked memory used by IB and their system limit. Start here:
https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage

3) The eager vs. rendezvous message size threshold.
I wonder if it may sit right where you see the latency spike.
https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user

4) Processor and memory locality/affinity and binding (please check
the current options and syntax)
https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4


On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users <
users@lists.open-mpi.org> wrote:

> Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php
>
> mpirun --verbose --display-map
>
> Have you tried newer OpenMPI versions?
>
> Do you get similar behavior for the osu_reduce and osu_gather benchmarks?
>
> Typically internal buffer sizes as well as your hardware will affect
> performance. Can you give specifications similar to what is available at:
> http://mvapich.cse.ohio-state.edu/performance/collectives/
> where the operating system, switch, node type and memory are indicated.
>
> If you need good performance, may want to also specify the algorithm
> used. You can find some of the parameters you can tune using:
>
> ompi_info --all
>
> A particular helpful parameter is:
>
> MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
> value: "ignore", data source: default, level: 5 tuner/detail, type: int)
>Which allreduce algorithm is used. Can be
> locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned
> reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
>Valid values: 0:"ignore", 1:"basic_linear",
> 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
> 5:"segmented_ring", 6:"rabenseifner"
>MCA coll tuned: parameter
> "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data
> source: default, level: 5 tuner/detail, type: int)
>
> For OpenMPI 4.0, there is a tuning program [2] that might also be helpful.
>
> [1]
>
> https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
> [2] https://github.com/open-mpi/ompi-collectives-tuning
>
> On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
> > Hi
> >
> > When i repeat i always got the huge discrepancy at the
> >
> > message size of 16384.
> >
> > May be there is a way to run mpi in verbose mode in order
> >
> > to further investigate this behaviour?
> >
> > Best
> >
> > Denis
> >
> > 
> > *From:* users  on behalf of Benson
> > Muite via users 
> > *Sent:* Monday, February 7, 2022 2:27:34 PM
> > *To:* users@lists.open-mpi.org
> > *Cc:* Benson Muite
> > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
> > network
> > Hi,
> > Do you get similar results when you repeat the test? Another job could
> > have interfered with your run.
> > Benson
> > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:
> >> Hi
> >>
> >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to
> >> check/benchmark
> >>
> >> the infiniband network for our cluster.
> >>
> >> For that i use the collective all_reduce benchmark and run over 200
> >> nodes, using 1 process per node.
> >>
> >> And this is the results i obtained 
> >>
> >>
> >>
> >> 
> >>
> >> # OSU MPI Allreduce Latency Test v5.7.1
> >> # Size   Avg Latency(us)   Min Latency(us)   Max Latency(us)
> Iterations
> >> 4 114.65 83.22147.98
> 1000
> >> 8 133.85106.47164.93
> 1000
> >> 16116.41 87.57150.58
> 1000
> >> 32112.17 93.25130.23
> 1000
> >> 64106.85 81.93134.74
> 1000
> >> 128   117.53 87.50152.27
> 1000
> >> 256   143.08115.63173.97
> 1000
> >> 512   130.34100.20167.56
> 1000
> >> 1024  155.67111.29188.20
> 1000
> >> 2048  151.82116.03198.19
> 1000
> >> 4096  159.11   

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Bertini, Denis Dr. via users
Hi

I changed the algorithm used to ring algorithm 4 ( for example ) and the

scan changed to


# OSU MPI Allreduce Latency Test v5.7.1
# Size   Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
4  59.39 51.04 65.36   1
8 109.13 90.14126.32   1
16253.26 60.89290.31   1
32 75.04 54.53 83.28   1
64 96.40 59.73111.45   1
12867.86 59.73 76.44   1
25676.32 67.33 85.18   1
512   129.93 85.76170.31   1
1024  168.51129.15194.68   1
2048  136.17110.09156.94   1
4096  173.59130.76199.21   1
8192  236.05170.77269.98   1
163844212.65   3627.71   4992.04   1
327681243.05   1205.11   1276.11   1
655361464.50   1364.76   1531.48   1
131072   1558.71   1454.52   1632.91   1
262144   1681.58   1609.15   1745.44   1
524288   2305.73   2178.17   2402.69   1
1048576  3389.83   3220.44   3517.61   1

Would this means that the first results was linked to the underlying algorithm 
used by defaults

in openMPI ( 0=ignore)?

Do you know what is this algorithm (0=ignore)?

I still see the wall for message=16384 though ...

Best

Denis







From: Benson Muite 
Sent: Monday, February 7, 2022 4:59:45 PM
To: Bertini, Denis Dr.; Open MPI Users
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php

mpirun --verbose --display-map

Have you tried newer OpenMPI versions?

Do you get similar behavior for the osu_reduce and osu_gather benchmarks?

Typically internal buffer sizes as well as your hardware will affect
performance. Can you give specifications similar to what is available at:
http://mvapich.cse.ohio-state.edu/performance/collectives/
where the operating system, switch, node type and memory are indicated.

If you need good performance, may want to also specify the algorithm
used. You can find some of the parameters you can tune using:

ompi_info --all

A particular helpful parameter is:

MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
value: "ignore", data source: default, level: 5 tuner/detail, type: int)
   Which allreduce algorithm is used. Can be
locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned
reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
   Valid values: 0:"ignore", 1:"basic_linear",
2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
5:"segmented_ring", 6:"rabenseifner"
   MCA coll tuned: parameter
"coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data
source: default, level: 5 tuner/detail, type: int)

For OpenMPI 4.0, there is a tuning program [2] that might also be helpful.

[1]
https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
[2] https://github.com/open-mpi/ompi-collectives-tuning

On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
> Hi
>
> When i repeat i always got the huge discrepancy at the
>
> message size of 16384.
>
> May be there is a way to run mpi in verbose mode in order
>
> to further investigate this behaviour?
>
> Best
>
> Denis
>
> 
> *From:* users  on behalf of Benson
> Muite via users 
> *Sent:* Monday, February 7, 2022 2:27:34 PM
> *To:* users@lists.open-mpi.org
> *Cc:* Benson Muite
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
> network
> Hi,
> Do you get similar results when you repeat the test? Another job could
> have interfered with your run.
> Benson
> On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:
>> Hi
>>
>> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to
>> check/benchmark
>>
>> the infiniband network for our cluster.
>>
>> For that i use the collective all_reduce benchmark and run over 200
>> nodes, using 1 process per node.
>>
>> And this is the results i obtained 
>>
>>
>>
>> 
>>
>> # OSU MPI Allreduce Latency Test v5.7.1
>> # Size   Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
>> 4 

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Bertini, Denis Dr. via users
Hi,
I ran the all gather becnhmarks and got this values
which show also a step wise preformance drop as function
of message size.
Would this be linked to the underlying algorithm used for collective operation?


 OSU MPI Allgather Latency Test v5.7.1
# Size   Avg Latency(us)
1  70.36
2  47.01
4  72.42
8  49.62
16 57.93
32 50.11
64 57.29
12874.05
256   454.41
512   544.04
1024  580.96
2048  711.40
4096  905.14
8192 2002.32
163842652.59
327684034.35
655366816.29
131072  14280.11
262144  28451.46
524288  54719.41
1048576106607.19



I use srun and not mpirun, how to activate the flage for verbosity in that case?


Best

Denis



From: Benson Muite 
Sent: Monday, February 7, 2022 4:59:45 PM
To: Bertini, Denis Dr.; Open MPI Users
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php

mpirun --verbose --display-map

Have you tried newer OpenMPI versions?

Do you get similar behavior for the osu_reduce and osu_gather benchmarks?

Typically internal buffer sizes as well as your hardware will affect
performance. Can you give specifications similar to what is available at:
http://mvapich.cse.ohio-state.edu/performance/collectives/
where the operating system, switch, node type and memory are indicated.

If you need good performance, may want to also specify the algorithm
used. You can find some of the parameters you can tune using:

ompi_info --all

A particular helpful parameter is:

MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current
value: "ignore", data source: default, level: 5 tuner/detail, type: int)
   Which allreduce algorithm is used. Can be
locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned
reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
   Valid values: 0:"ignore", 1:"basic_linear",
2:"nonoverlapping", 3:"recursive_doubling", 4:"ring",
5:"segmented_ring", 6:"rabenseifner"
   MCA coll tuned: parameter
"coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data
source: default, level: 5 tuner/detail, type: int)

For OpenMPI 4.0, there is a tuning program [2] that might also be helpful.

[1]
https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi
[2] https://github.com/open-mpi/ompi-collectives-tuning

On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:
> Hi
>
> When i repeat i always got the huge discrepancy at the
>
> message size of 16384.
>
> May be there is a way to run mpi in verbose mode in order
>
> to further investigate this behaviour?
>
> Best
>
> Denis
>
> 
> *From:* users  on behalf of Benson
> Muite via users 
> *Sent:* Monday, February 7, 2022 2:27:34 PM
> *To:* users@lists.open-mpi.org
> *Cc:* Benson Muite
> *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband
> network
> Hi,
> Do you get similar results when you repeat the test? Another job could
> have interfered with your run.
> Benson
> On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:
>> Hi
>>
>> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to
>> check/benchmark
>>
>> the infiniband network for our cluster.
>>
>> For that i use the collective all_reduce benchmark and run over 200
>> nodes, using 1 process per node.
>>
>> And this is the results i obtained 
>>
>>
>>
>> 
>>
>> # OSU MPI Allreduce Latency Test v5.7.1
>> # Size   Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
>> 4 114.65 83.22147.981000
>> 8 133.85106.47164.931000
>> 16116.41 87.57150.581000
>> 32112.17 93.25130.231000
>> 64106.85 81.93134.741000
>> 128   117.53 87.50152.271000
>> 256   143.08115.63173.971000
>> 512   130.34100.20167.561000
>> 1024  155.67111.29188.201000
>> 2048  151.82116.03198.191000
>> 4096  159.11122.09199.241000
>> 8192  176.74143.54221.98

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Benson Muite via users

Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php

mpirun --verbose --display-map

Have you tried newer OpenMPI versions?

Do you get similar behavior for the osu_reduce and osu_gather benchmarks?

Typically internal buffer sizes as well as your hardware will affect 
performance. Can you give specifications similar to what is available at:

http://mvapich.cse.ohio-state.edu/performance/collectives/
where the operating system, switch, node type and memory are indicated.

If you need good performance, may want to also specify the algorithm 
used. You can find some of the parameters you can tune using:


ompi_info --all

A particular helpful parameter is:

MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current 
value: "ignore", data source: default, level: 5 tuner/detail, type: int)
  Which allreduce algorithm is used. Can be 
locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned 
reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring
  Valid values: 0:"ignore", 1:"basic_linear", 
2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", 
5:"segmented_ring", 6:"rabenseifner"
  MCA coll tuned: parameter 
"coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data 
source: default, level: 5 tuner/detail, type: int)


For OpenMPI 4.0, there is a tuning program [2] that might also be helpful.

[1] 
https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi

[2] https://github.com/open-mpi/ompi-collectives-tuning

On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote:

Hi

When i repeat i always got the huge discrepancy at the

message size of 16384.

May be there is a way to run mpi in verbose mode in order

to further investigate this behaviour?

Best

Denis


*From:* users  on behalf of Benson 
Muite via users 

*Sent:* Monday, February 7, 2022 2:27:34 PM
*To:* users@lists.open-mpi.org
*Cc:* Benson Muite
*Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband 
network

Hi,
Do you get similar results when you repeat the test? Another job could
have interfered with your run.
Benson
On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:

Hi

I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to 
check/benchmark


the infiniband network for our cluster.

For that i use the collective all_reduce benchmark and run over 200 
nodes, using 1 process per node.


And this is the results i obtained 





# OSU MPI Allreduce Latency Test v5.7.1
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
4                     114.65             83.22            147.98        1000
8                     133.85            106.47            164.93        1000
16                    116.41             87.57            150.58        1000
32                    112.17             93.25            130.23        1000
64                    106.85             81.93            134.74        1000
128                   117.53             87.50            152.27        1000
256                   143.08            115.63            173.97        1000
512                   130.34            100.20            167.56        1000
1024                  155.67            111.29            188.20        1000
2048                  151.82            116.03            198.19        1000
4096                  159.11            122.09            199.24        1000
8192                  176.74            143.54            221.98        1000
16384               48862.85          39270.21          54970.96        1000
32768                2737.37           2614.60           2802.68        1000
65536                2723.15           2585.62           2813.65        1000



Could someone explain me what is happening for message = 16384 ?
One can notice a huge latency (~ 300 time larger)  compare to message 
size = 8192.
I do not really understand what could  create such an increase in the 
latency.
The reason i use the OSU microbenchmarks is that we 
sporadically experience a drop
in the bandwith for typical collective operations such as MPI_Reduce in 
our cluster

which is difficult to understand.
I would be grateful if somebody can share its expertise or such problem 
with me.


Best,
Denis



-
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bert...@gsi.de

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des 

Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application (users Digest, Vol 4715, Issue 1)

2022-02-07 Thread David Perozzi via users

Hi Bernd,

Thanks for your valuable input! Your suggested approach indeed seems 
like the correct one and is actually what I've always wanted to do. In 
the past, I've also asked our cluster support if there was this 
possibility, but they always suggested the following approach:



export OMP_NUM_THREADS=T
bsub -n N -R "span[ptile=T]" "unset LSB_AFFINITY_HOSTFILE ; mpirun -n 
M --map-by node:PE=T ./my_hybrid_program"


where N=M*T (https://scicomp.ethz.ch/wiki/Hybrid_jobs). However, this 
can sometimes be indirectly "penalized" at the dispatch because of the 
ptile constraint (which is not strictly needed, as block would be an 
acceptable, looser constraint). That's why I wanted to define the 
rankfile by myself (to use a span[block=] requirement).


I now tried your suggested commands and could get what I want with a 
slight variation:


bsub -n 6 -R "span[block=2] affinity[core(4,same=numa, 
exclusive=(core,injob)):cpubind=numa:membind=localprefer]" "export 
OMP_NUM_THREADS=4 ; export OMP_PLACES=cores; export 
OMP_PROC_BIND=true; mpirun -n 6 -nooversubscribe -map-by slot:PE=4 
./test_dispatch"


Something that seems different between the cluster you are using and the 
one I am using is that I have to set the correct number of 
OMP_NUM_THREADS by myself. Also, in our cluster, bsub's -n parameter is 
assumed as "core", not as "slot". So I'm not sure of two things:


1. Is the correct number of cores allocated to my job, or can they
   somehow be oversubscribed by other jobs?
2. I think that a problem may be that the memory reservation
   (rusage[mem=]) is referred to the "cores" as defined by the -n
   parameter. So, for a job that needs a lot of threads and memory, I
   should have to increase the memory "per core" (i.e., per slot,
   actually), and I'm not sure if bsub then considers it correctly.


Anyway, I tried to run my code for about 5hours once with an affinity 
requirement and 16 slots of 16cores each, and once with a ptile 
requirement resulting in the same configuration (MPI processes and 
OpenMP threads) as suggested by our support. To my surprise, the latter 
was much more efficient than the former (I expected the former to give 
the same performance or better). I explicitly chose nodes with the same 
architecture.
With affinity, the CPU utilization was 22% and the simulation did not 
come past the 10% process; with ptile it was 58% and reached about 35% 
of progress. Overall, the values for CPU utilization are so low because 
in the first hour, the input files must be read and the simulation set 
up (serially and involving the reading of many small files). As a 
reference, another simulation has been running for 60 hours and has 93% 
of CPU utilization.


I know I should clarify with my cluster support, but I wanted to share 
my experience. You may also have an idea why I get such bad 
performances. Is it possible that bsub is configured in a different way 
than yours? (By the way, I'm using IBM Spectrum LSF Standard 10.1.0.7)



Best regards,
David





On 03.02.22 13:23, Bernd Dammann via users wrote:

Hi David,

On 03/02/2022 00:03 , David Perozzi wrote:

Helo,

I'm trying to run a code implemented with OpenMPI and OpenMP (for 
threading) on a large cluster that uses LSF for the job scheduling 
and dispatch. The problem with LSF is that it is not very 
straightforward to allocate and bind the right amount of threads to 
an MPI rank inside a single node. Therefore, I have to create a 
rankfile myself, as soon as the (a priori unknown) ressources are 
allocated.


So, after my job get dispatched, I run:

mpirun -n "$nslots" -display-allocation -nooversubscribe --map-by 
core:PE=1 --bind-to core mpi_allocation/show_numactl.sh 
 >mpi_allocation/allocation_files/allocation.txt


Just out of curiosity: why do you not use the built-in LSF features to 
do this mapping?  Something like


#BSUB -n 4
#BSUB -R "span[block=1] affinity[core(4)]"

mpirun ./MyHybridApplication

This will give you 4 cores for each of your 4 MPI ranks, and it sets 
OMP_NUM_THREADS=4 automatically.  LSF's affinity is even more fine 
grained, so you can specify that the 4 cores should be on one socket 
(e.g. if your application is memory bound, and you want to make use of 
more memory bandwidth).  Check the LSF documentation for more details.


Examples:

1) with span[block=...] (allow LSF to place resources on one host)

#BSUB -n 4
#BSUB -R "span[block=1] affinity[core(4)]"

export OMP_DISPLAY_AFFINITY=true
export OMP_AFFINITY_FORMAT="host: %H PID: %P TID: %n affinity: %A"
mpirun --tag-output ./hello

gives this output (sorted):

[1,0]:host: node-23-8 PID: 2798 TID: 0 affinity: 0
[1,0]:host: node-23-8 PID: 2798 TID: 1 affinity: 1
[1,0]:host: node-23-8 PID: 2798 TID: 2 affinity: 2
[1,0]:host: node-23-8 PID: 2798 TID: 3 affinity: 3
[1,0]:Hello world from thread 0!
[1,0]:Hello world from thread 1!
[1,0]:Hello world from thread 2!
[1,0]:Hello world from thread 3!
[1,1]:host: node-23-8 PID: 2799 TID: 0 affinity: 4
[1,1]:host: 

Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Bertini, Denis Dr. via users
Hi

When i repeat i always got the huge discrepancy at the

message size of 16384.

May be there is a way to run mpi in verbose mode in order

to further investigate this behaviour?

Best

Denis


From: users  on behalf of Benson Muite via 
users 
Sent: Monday, February 7, 2022 2:27:34 PM
To: users@lists.open-mpi.org
Cc: Benson Muite
Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

Hi,
Do you get similar results when you repeat the test? Another job could
have interfered with your run.
Benson
On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:
> Hi
>
> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to
> check/benchmark
>
> the infiniband network for our cluster.
>
> For that i use the collective all_reduce benchmark and run over 200
> nodes, using 1 process per node.
>
> And this is the results i obtained 
>
>
>
> 
>
> # OSU MPI Allreduce Latency Test v5.7.1
> # Size   Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
> 4 114.65 83.22147.981000
> 8 133.85106.47164.931000
> 16116.41 87.57150.581000
> 32112.17 93.25130.231000
> 64106.85 81.93134.741000
> 128   117.53 87.50152.271000
> 256   143.08115.63173.971000
> 512   130.34100.20167.561000
> 1024  155.67111.29188.201000
> 2048  151.82116.03198.191000
> 4096  159.11122.09199.241000
> 8192  176.74143.54221.981000
> 16384   48862.85  39270.21  54970.961000
> 327682737.37   2614.60   2802.681000
> 655362723.15   2585.62   2813.651000
>
> 
>
> Could someone explain me what is happening for message = 16384 ?
> One can notice a huge latency (~ 300 time larger)  compare to message
> size = 8192.
> I do not really understand what could  create such an increase in the
> latency.
> The reason i use the OSU microbenchmarks is that we
> sporadically experience a drop
> in the bandwith for typical collective operations such as MPI_Reduce in
> our cluster
> which is difficult to understand.
> I would be grateful if somebody can share its expertise or such problem
> with me.
>
> Best,
> Denis
>
>
>
> -
> Denis Bertini
> Abteilung: CIT
> Ort: SB3 2.265a
>
> Tel: +49 6159 71 2240
> Fax: +49 6159 71 2986
> E-Mail: d.bert...@gsi.de
>
> GSI Helmholtzzentrum für Schwerionenforschung GmbH
> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
>
> Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
> Managing Directors / Geschäftsführung:
> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
> Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
> Ministerialdirigent Dr. Volkmar Dietz
>



Re: [OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Benson Muite via users

Hi,
Do you get similar results when you repeat the test? Another job could 
have interfered with your run.

Benson
On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote:

Hi

I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to 
check/benchmark


the infiniband network for our cluster.

For that i use the collective all_reduce benchmark and run over 200 
nodes, using 1 process per node.


And this is the results i obtained 





# OSU MPI Allreduce Latency Test v5.7.1
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
4                     114.65             83.22            147.98        1000
8                     133.85            106.47            164.93        1000
16                    116.41             87.57            150.58        1000
32                    112.17             93.25            130.23        1000
64                    106.85             81.93            134.74        1000
128                   117.53             87.50            152.27        1000
256                   143.08            115.63            173.97        1000
512                   130.34            100.20            167.56        1000
1024                  155.67            111.29            188.20        1000
2048                  151.82            116.03            198.19        1000
4096                  159.11            122.09            199.24        1000
8192                  176.74            143.54            221.98        1000
16384               48862.85          39270.21          54970.96        1000
32768                2737.37           2614.60           2802.68        1000
65536                2723.15           2585.62           2813.65        1000



Could someone explain me what is happening for message = 16384 ?
One can notice a huge latency (~ 300 time larger)  compare to message 
size = 8192.
I do not really understand what could  create such an increase in the 
latency.
The reason i use the OSU microbenchmarks is that we 
sporadically experience a drop
in the bandwith for typical collective operations such as MPI_Reduce in 
our cluster

which is difficult to understand.
I would be grateful if somebody can share its expertise or such problem 
with me.


Best,
Denis



-
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bert...@gsi.de

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz





[OMPI users] Using OSU benchmarks for checking Infiniband network

2022-02-07 Thread Bertini, Denis Dr. via users
Hi

I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to 
check/benchmark

the infiniband network for our cluster.

For that i use the collective all_reduce benchmark and run over 200 nodes, 
using 1 process per node.

And this is the results i obtained 





# OSU MPI Allreduce Latency Test v5.7.1
# Size   Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
4 114.65 83.22147.981000
8 133.85106.47164.931000
16116.41 87.57150.581000
32112.17 93.25130.231000
64106.85 81.93134.741000
128   117.53 87.50152.271000
256   143.08115.63173.971000
512   130.34100.20167.561000
1024  155.67111.29188.201000
2048  151.82116.03198.191000
4096  159.11122.09199.241000
8192  176.74143.54221.981000
16384   48862.85  39270.21  54970.961000
327682737.37   2614.60   2802.681000
655362723.15   2585.62   2813.651000



Could someone explain me what is happening for message = 16384 ?
One can notice a huge latency (~ 300 time larger)  compare to message size = 
8192.
I do not really understand what could  create such an increase in the latency.
The reason i use the OSU microbenchmarks is that we sporadically experience a 
drop
in the bandwith for typical collective operations such as MPI_Reduce in our 
cluster
which is difficult to understand.
I would be grateful if somebody can share its expertise or such problem with me.

Best,
Denis



-
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bert...@gsi.de

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz


[hwloc-users] glibc struggling with get_nprocs and get_nprocs_conf

2022-02-07 Thread Samuel Thibault
Hello,

For information, glibc is struggling with the problematic of the
precise meaning of get_nprocs, get_nprocs_conf, _SC_NPROCESSORS_CONF,
_SC_NPROCESSORS_ONLN

https://sourceware.org/pipermail/libc-alpha/2022-February/136177.html

Samuel
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users