Re: [OMPI users] Gadget2 error 818 when using more than 1 process?
Sorry for late answer. I thought the same, but after more testing now I don't, since re-running the same code on the same data on the same node with the same parameters sometimes works and sometimes doesn't. The user says it works (reliably) unmodified on other clusters. We'll try contacting Gadget2 authors, too. Il 27/01/2022 14:52, Jeff Squyres (jsquyres) ha scritto: I'm afraid that without any further details, it's hard to help. I don't know why Gadget2 would complain about its parameters file. From what you've stated, it could be a problem with the application itself. Have you talked to the Gadget2 authors? -- Jeff Squyres jsquy...@cisco.com From: users on behalf of Diego Zuccato via users Sent: Wednesday, January 26, 2022 2:06 AM To: users@lists.open-mpi.org Cc: Diego Zuccato Subject: Re: [OMPI users] Gadget2 error 818 when using more than 1 process? Il 26/01/2022 02:10, Jeff Squyres (jsquyres) via users ha scritto: I'm afraid I don't know anything about Gadget, so I can't comment there. How exactly does the application fail? Neither did I :( It fails saying a 'timestep' is 0, and that's usually caused by an error in the parameters file. But the parameters file is OK, and it actually works if the user runs it in a single process. Or even with multithreaded runs, sometimes and on some nodes. That's quite random :( But the runs are usually single-node (simple examples for students). Can you try upgrading to Open MPI v4.1.2? That would be a real mess. I'm stuck with packages provided by Debian stable. I lack both the manpower and the knowledge to compile everything from scratch, given the intricate relations between slurm, openmpi, infiniband, etc. :( What networking are you using? Infiniband (Mellanox cards, w/ Debian-supplied drivers and support programs) and ethernet. Infiniband is also used by IPoIB to reach the storage servers (gluster). Some nodes lacks IB, so access to the storage is achieved by a couple of iptables rules. From: users on behalf of Diego Zuccato via users Sent: Tuesday, January 25, 2022 5:43 AM To: Open MPI Users Cc: Diego Zuccato Subject: [OMPI users] Gadget2 error 818 when using more than 1 process? Hello all. A user of our cluster is experiencing a weird problem that I can't pinpoint. He does have a job script that worked well on every node. I's based on Gadget2. Lately, *sometimes*, the same executable with the same parameters file works, sometimes it fails. On the same node and submitting with the same command. On some nodes it always fails. But if it gets reduced to sequential (asking for just one process), it completes correctly (so the parameters file, common source of Gadget2 error 818, seems innocent). The cluster uses SLURM and limits resources using cgroups, if that matters. Seems most of the issues started after upgrading from openmpi 3.1.3 to 4.1.0 in september. Maybe related, the nodes started spitting out these warnings (that IIUC should be harmless... but I'd like to debug & resolve anyway): -8<-- Open MPI's OFI driver detected multiple equidistant NICs from the current process, but had insufficient information to ensure MPI processes fairly pick a NIC for use. This may negatively impact performance. A more modern PMIx server is necessary to resolve this issue. -8<-- Code is run (from the jobfile) with: srun --mpi=pmix_v4 ./Gadget2 paramfile (we also tried with a simple mpirun w/ no extra parameters leveraging SLURM's integration/autodetection -- same result) Any hints? TIA -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Re: [OMPI users] [EXTERNAL] openib BTL disabled when using MPI_Init_thread
HI Jose, I bet this device has not been tested with ucx. You may want to join the ucx users mail list at https://elist.ornl.gov/mailman/listinfo/ucx-group and ask whether this Marvell device has been tested and workarounds for disabling features that this device doesn't support. Again though, you really may want to first see if the TCP btl will be good enough for your cluster. Howard On 2/4/22, 8:03 AM, "Jose E. Roman" wrote: Howard, I don't have much time now to try with --enable-debug. The RoCE device we have is FastLinQ QL41000 Series 10/25/40/50GbE Controller The output of ibv_devinfo is: hca_id: qedr0 transport: InfiniBand (0) fw_ver: 8.20.0.0 node_guid: 2267:7cff:fe11:4a50 sys_image_guid: 2267:7cff:fe11:4a50 vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: qedr1 transport: InfiniBand (0) fw_ver: 8.20.0.0 node_guid: 2267:7cff:fe11:4a51 sys_image_guid: 2267:7cff:fe11:4a51 vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu:4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet Regarding UCX, we have tried with the latest version. Compilation goes through, but the ucv_info command gives an error: # Memory domain: qedr0 # Component: ib # register: unlimited, cost: 180 nsec # remote key: 8 bytes # local memory handle is required for zcopy # # Transport: rc_verbs # Device: qedr0:1 # Type: network # System device: qedr0 (0) [1643982133.674556] [kahan01:8217 :0]rc_iface.c:505 UCX ERROR ibv_create_srq() failed: Function not implemented # < failed to open interface > # # Transport: ud_verbs # Device: qedr0:1 # Type: network # System device: qedr0 (0) [qelr_create_qp:545]create qp: failed on ibv_cmd_create_qp with 22 [1643982133.681169] [kahan01:8217 :0]ib_iface.c:994 UCX ERROR iface=0x56074944bf10: failed to create UD QP TX wr:256 sge:6 inl:64 resp:0 RX wr:4096 sge:1 resp:0: Invalid argument # < failed to open interface > # # Memory domain: qedr1 # Component: ib # register: unlimited, cost: 180 nsec # remote key: 8 bytes # local memory handle is required for zcopy # < no supported devices found > Any idea what the error in ibv_create_srq() means? Thanks for your help. Jose > El 3 feb 2022, a las 17:52, Pritchard Jr., Howard escribió: > > Hi Jose, > > A number of things. > > First for recent versions of Open MPI including the 4.1.x release stream, MPI_THREAD_MULTIPLE is supported by default. However, some transport options available when using MPI_Init may not be available when requesting MPI_THREAD_MULTIPLE. > You may want to let Open MPI trundle along with tcp used for inter-node messaging and see if your application performs well enough. For a small system tcp may well suffice. > > Second, if you want to pursue this further, you want to rebuild Open MPI with --enable-debug. The debug output will be considerably more verbose and provides more info. I think you will get a message saying rdmacm CPC is excluded owing to the requested thread support level. There may be info about why udcm is not selected as well. > > Third, what sort of RoCE devices are available on your system? The output from ibv_devinfo may be useful. > > As for UCX, if it’s the version that came with your ubuntu release 18.0.4 it may be pretty old. It's likely that UCX has not been tested on the RoCE devices on your system. > > You can run > > ucx_info -v > > to
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
This may have changed since, but these used to be relevant points. Overall, the Open MPI FAQ have lots of good suggestions: https://www.open-mpi.org/faq/ some specific for performance tuning: https://www.open-mpi.org/faq/?category=tuning https://www.open-mpi.org/faq/?category=openfabrics 1) Make sure you are not using the Ethernet TCP/IP, which is widely available in compute nodes: mpirun --mca btl self,sm,openib ... https://www.open-mpi.org/faq/?category=tuning#selecting-components However, this may have changed lately: https://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable 2) Maximum locked memory used by IB and their system limit. Start here: https://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage 3) The eager vs. rendezvous message size threshold. I wonder if it may sit right where you see the latency spike. https://www.open-mpi.org/faq/?category=all#ib-locked-pages-user 4) Processor and memory locality/affinity and binding (please check the current options and syntax) https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4 On Mon, Feb 7, 2022 at 11:01 AM Benson Muite via users < users@lists.open-mpi.org> wrote: > Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php > > mpirun --verbose --display-map > > Have you tried newer OpenMPI versions? > > Do you get similar behavior for the osu_reduce and osu_gather benchmarks? > > Typically internal buffer sizes as well as your hardware will affect > performance. Can you give specifications similar to what is available at: > http://mvapich.cse.ohio-state.edu/performance/collectives/ > where the operating system, switch, node type and memory are indicated. > > If you need good performance, may want to also specify the algorithm > used. You can find some of the parameters you can tune using: > > ompi_info --all > > A particular helpful parameter is: > > MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current > value: "ignore", data source: default, level: 5 tuner/detail, type: int) >Which allreduce algorithm is used. Can be > locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned > reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring >Valid values: 0:"ignore", 1:"basic_linear", > 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", > 5:"segmented_ring", 6:"rabenseifner" >MCA coll tuned: parameter > "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data > source: default, level: 5 tuner/detail, type: int) > > For OpenMPI 4.0, there is a tuning program [2] that might also be helpful. > > [1] > > https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi > [2] https://github.com/open-mpi/ompi-collectives-tuning > > On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote: > > Hi > > > > When i repeat i always got the huge discrepancy at the > > > > message size of 16384. > > > > May be there is a way to run mpi in verbose mode in order > > > > to further investigate this behaviour? > > > > Best > > > > Denis > > > > > > *From:* users on behalf of Benson > > Muite via users > > *Sent:* Monday, February 7, 2022 2:27:34 PM > > *To:* users@lists.open-mpi.org > > *Cc:* Benson Muite > > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband > > network > > Hi, > > Do you get similar results when you repeat the test? Another job could > > have interfered with your run. > > Benson > > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: > >> Hi > >> > >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to > >> check/benchmark > >> > >> the infiniband network for our cluster. > >> > >> For that i use the collective all_reduce benchmark and run over 200 > >> nodes, using 1 process per node. > >> > >> And this is the results i obtained > >> > >> > >> > >> > >> > >> # OSU MPI Allreduce Latency Test v5.7.1 > >> # Size Avg Latency(us) Min Latency(us) Max Latency(us) > Iterations > >> 4 114.65 83.22147.98 > 1000 > >> 8 133.85106.47164.93 > 1000 > >> 16116.41 87.57150.58 > 1000 > >> 32112.17 93.25130.23 > 1000 > >> 64106.85 81.93134.74 > 1000 > >> 128 117.53 87.50152.27 > 1000 > >> 256 143.08115.63173.97 > 1000 > >> 512 130.34100.20167.56 > 1000 > >> 1024 155.67111.29188.20 > 1000 > >> 2048 151.82116.03198.19 > 1000 > >> 4096 159.11
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi I changed the algorithm used to ring algorithm 4 ( for example ) and the scan changed to # OSU MPI Allreduce Latency Test v5.7.1 # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 4 59.39 51.04 65.36 1 8 109.13 90.14126.32 1 16253.26 60.89290.31 1 32 75.04 54.53 83.28 1 64 96.40 59.73111.45 1 12867.86 59.73 76.44 1 25676.32 67.33 85.18 1 512 129.93 85.76170.31 1 1024 168.51129.15194.68 1 2048 136.17110.09156.94 1 4096 173.59130.76199.21 1 8192 236.05170.77269.98 1 163844212.65 3627.71 4992.04 1 327681243.05 1205.11 1276.11 1 655361464.50 1364.76 1531.48 1 131072 1558.71 1454.52 1632.91 1 262144 1681.58 1609.15 1745.44 1 524288 2305.73 2178.17 2402.69 1 1048576 3389.83 3220.44 3517.61 1 Would this means that the first results was linked to the underlying algorithm used by defaults in openMPI ( 0=ignore)? Do you know what is this algorithm (0=ignore)? I still see the wall for message=16384 though ... Best Denis From: Benson Muite Sent: Monday, February 7, 2022 4:59:45 PM To: Bertini, Denis Dr.; Open MPI Users Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php mpirun --verbose --display-map Have you tried newer OpenMPI versions? Do you get similar behavior for the osu_reduce and osu_gather benchmarks? Typically internal buffer sizes as well as your hardware will affect performance. Can you give specifications similar to what is available at: http://mvapich.cse.ohio-state.edu/performance/collectives/ where the operating system, switch, node type and memory are indicated. If you need good performance, may want to also specify the algorithm used. You can find some of the parameters you can tune using: ompi_info --all A particular helpful parameter is: MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current value: "ignore", data source: default, level: 5 tuner/detail, type: int) Which allreduce algorithm is used. Can be locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring Valid values: 0:"ignore", 1:"basic_linear", 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", 5:"segmented_ring", 6:"rabenseifner" MCA coll tuned: parameter "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data source: default, level: 5 tuner/detail, type: int) For OpenMPI 4.0, there is a tuning program [2] that might also be helpful. [1] https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi [2] https://github.com/open-mpi/ompi-collectives-tuning On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote: > Hi > > When i repeat i always got the huge discrepancy at the > > message size of 16384. > > May be there is a way to run mpi in verbose mode in order > > to further investigate this behaviour? > > Best > > Denis > > > *From:* users on behalf of Benson > Muite via users > *Sent:* Monday, February 7, 2022 2:27:34 PM > *To:* users@lists.open-mpi.org > *Cc:* Benson Muite > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband > network > Hi, > Do you get similar results when you repeat the test? Another job could > have interfered with your run. > Benson > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: >> Hi >> >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to >> check/benchmark >> >> the infiniband network for our cluster. >> >> For that i use the collective all_reduce benchmark and run over 200 >> nodes, using 1 process per node. >> >> And this is the results i obtained >> >> >> >> >> >> # OSU MPI Allreduce Latency Test v5.7.1 >> # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations >> 4
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi, I ran the all gather becnhmarks and got this values which show also a step wise preformance drop as function of message size. Would this be linked to the underlying algorithm used for collective operation? OSU MPI Allgather Latency Test v5.7.1 # Size Avg Latency(us) 1 70.36 2 47.01 4 72.42 8 49.62 16 57.93 32 50.11 64 57.29 12874.05 256 454.41 512 544.04 1024 580.96 2048 711.40 4096 905.14 8192 2002.32 163842652.59 327684034.35 655366816.29 131072 14280.11 262144 28451.46 524288 54719.41 1048576106607.19 I use srun and not mpirun, how to activate the flage for verbosity in that case? Best Denis From: Benson Muite Sent: Monday, February 7, 2022 4:59:45 PM To: Bertini, Denis Dr.; Open MPI Users Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php mpirun --verbose --display-map Have you tried newer OpenMPI versions? Do you get similar behavior for the osu_reduce and osu_gather benchmarks? Typically internal buffer sizes as well as your hardware will affect performance. Can you give specifications similar to what is available at: http://mvapich.cse.ohio-state.edu/performance/collectives/ where the operating system, switch, node type and memory are indicated. If you need good performance, may want to also specify the algorithm used. You can find some of the parameters you can tune using: ompi_info --all A particular helpful parameter is: MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current value: "ignore", data source: default, level: 5 tuner/detail, type: int) Which allreduce algorithm is used. Can be locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring Valid values: 0:"ignore", 1:"basic_linear", 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", 5:"segmented_ring", 6:"rabenseifner" MCA coll tuned: parameter "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data source: default, level: 5 tuner/detail, type: int) For OpenMPI 4.0, there is a tuning program [2] that might also be helpful. [1] https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi [2] https://github.com/open-mpi/ompi-collectives-tuning On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote: > Hi > > When i repeat i always got the huge discrepancy at the > > message size of 16384. > > May be there is a way to run mpi in verbose mode in order > > to further investigate this behaviour? > > Best > > Denis > > > *From:* users on behalf of Benson > Muite via users > *Sent:* Monday, February 7, 2022 2:27:34 PM > *To:* users@lists.open-mpi.org > *Cc:* Benson Muite > *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband > network > Hi, > Do you get similar results when you repeat the test? Another job could > have interfered with your run. > Benson > On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: >> Hi >> >> I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to >> check/benchmark >> >> the infiniband network for our cluster. >> >> For that i use the collective all_reduce benchmark and run over 200 >> nodes, using 1 process per node. >> >> And this is the results i obtained >> >> >> >> >> >> # OSU MPI Allreduce Latency Test v5.7.1 >> # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations >> 4 114.65 83.22147.981000 >> 8 133.85106.47164.931000 >> 16116.41 87.57150.581000 >> 32112.17 93.25130.231000 >> 64106.85 81.93134.741000 >> 128 117.53 87.50152.271000 >> 256 143.08115.63173.971000 >> 512 130.34100.20167.561000 >> 1024 155.67111.29188.201000 >> 2048 151.82116.03198.191000 >> 4096 159.11122.09199.241000 >> 8192 176.74143.54221.98
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Following https://www.open-mpi.org/doc/v3.1/man1/mpirun.1.php mpirun --verbose --display-map Have you tried newer OpenMPI versions? Do you get similar behavior for the osu_reduce and osu_gather benchmarks? Typically internal buffer sizes as well as your hardware will affect performance. Can you give specifications similar to what is available at: http://mvapich.cse.ohio-state.edu/performance/collectives/ where the operating system, switch, node type and memory are indicated. If you need good performance, may want to also specify the algorithm used. You can find some of the parameters you can tune using: ompi_info --all A particular helpful parameter is: MCA coll tuned: parameter "coll_tuned_allreduce_algorithm" (current value: "ignore", data source: default, level: 5 tuner/detail, type: int) Which allreduce algorithm is used. Can be locked down to any of: 0 ignore, 1 basic linear, 2 nonoverlapping (tuned reduce + tuned bcast), 3 recursive doubling, 4 ring, 5 segmented ring Valid values: 0:"ignore", 1:"basic_linear", 2:"nonoverlapping", 3:"recursive_doubling", 4:"ring", 5:"segmented_ring", 6:"rabenseifner" MCA coll tuned: parameter "coll_tuned_allreduce_algorithm_segmentsize" (current value: "0", data source: default, level: 5 tuner/detail, type: int) For OpenMPI 4.0, there is a tuning program [2] that might also be helpful. [1] https://stackoverflow.com/questions/36635061/how-to-check-which-mca-parameters-are-used-in-openmpi [2] https://github.com/open-mpi/ompi-collectives-tuning On 2/7/22 4:49 PM, Bertini, Denis Dr. wrote: Hi When i repeat i always got the huge discrepancy at the message size of 16384. May be there is a way to run mpi in verbose mode in order to further investigate this behaviour? Best Denis *From:* users on behalf of Benson Muite via users *Sent:* Monday, February 7, 2022 2:27:34 PM *To:* users@lists.open-mpi.org *Cc:* Benson Muite *Subject:* Re: [OMPI users] Using OSU benchmarks for checking Infiniband network Hi, Do you get similar results when you repeat the test? Another job could have interfered with your run. Benson On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: Hi I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to check/benchmark the infiniband network for our cluster. For that i use the collective all_reduce benchmark and run over 200 nodes, using 1 process per node. And this is the results i obtained # OSU MPI Allreduce Latency Test v5.7.1 # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 4 114.65 83.22 147.98 1000 8 133.85 106.47 164.93 1000 16 116.41 87.57 150.58 1000 32 112.17 93.25 130.23 1000 64 106.85 81.93 134.74 1000 128 117.53 87.50 152.27 1000 256 143.08 115.63 173.97 1000 512 130.34 100.20 167.56 1000 1024 155.67 111.29 188.20 1000 2048 151.82 116.03 198.19 1000 4096 159.11 122.09 199.24 1000 8192 176.74 143.54 221.98 1000 16384 48862.85 39270.21 54970.96 1000 32768 2737.37 2614.60 2802.68 1000 65536 2723.15 2585.62 2813.65 1000 Could someone explain me what is happening for message = 16384 ? One can notice a huge latency (~ 300 time larger) compare to message size = 8192. I do not really understand what could create such an increase in the latency. The reason i use the OSU microbenchmarks is that we sporadically experience a drop in the bandwith for typical collective operations such as MPI_Reduce in our cluster which is difficult to understand. I would be grateful if somebody can share its expertise or such problem with me. Best, Denis - Denis Bertini Abteilung: CIT Ort: SB3 2.265a Tel: +49 6159 71 2240 Fax: +49 6159 71 2986 E-Mail: d.bert...@gsi.de GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock Chairman of the GSI Supervisory Board / Vorsitzender des
Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application (users Digest, Vol 4715, Issue 1)
Hi Bernd, Thanks for your valuable input! Your suggested approach indeed seems like the correct one and is actually what I've always wanted to do. In the past, I've also asked our cluster support if there was this possibility, but they always suggested the following approach: export OMP_NUM_THREADS=T bsub -n N -R "span[ptile=T]" "unset LSB_AFFINITY_HOSTFILE ; mpirun -n M --map-by node:PE=T ./my_hybrid_program" where N=M*T (https://scicomp.ethz.ch/wiki/Hybrid_jobs). However, this can sometimes be indirectly "penalized" at the dispatch because of the ptile constraint (which is not strictly needed, as block would be an acceptable, looser constraint). That's why I wanted to define the rankfile by myself (to use a span[block=] requirement). I now tried your suggested commands and could get what I want with a slight variation: bsub -n 6 -R "span[block=2] affinity[core(4,same=numa, exclusive=(core,injob)):cpubind=numa:membind=localprefer]" "export OMP_NUM_THREADS=4 ; export OMP_PLACES=cores; export OMP_PROC_BIND=true; mpirun -n 6 -nooversubscribe -map-by slot:PE=4 ./test_dispatch" Something that seems different between the cluster you are using and the one I am using is that I have to set the correct number of OMP_NUM_THREADS by myself. Also, in our cluster, bsub's -n parameter is assumed as "core", not as "slot". So I'm not sure of two things: 1. Is the correct number of cores allocated to my job, or can they somehow be oversubscribed by other jobs? 2. I think that a problem may be that the memory reservation (rusage[mem=]) is referred to the "cores" as defined by the -n parameter. So, for a job that needs a lot of threads and memory, I should have to increase the memory "per core" (i.e., per slot, actually), and I'm not sure if bsub then considers it correctly. Anyway, I tried to run my code for about 5hours once with an affinity requirement and 16 slots of 16cores each, and once with a ptile requirement resulting in the same configuration (MPI processes and OpenMP threads) as suggested by our support. To my surprise, the latter was much more efficient than the former (I expected the former to give the same performance or better). I explicitly chose nodes with the same architecture. With affinity, the CPU utilization was 22% and the simulation did not come past the 10% process; with ptile it was 58% and reached about 35% of progress. Overall, the values for CPU utilization are so low because in the first hour, the input files must be read and the simulation set up (serially and involving the reading of many small files). As a reference, another simulation has been running for 60 hours and has 93% of CPU utilization. I know I should clarify with my cluster support, but I wanted to share my experience. You may also have an idea why I get such bad performances. Is it possible that bsub is configured in a different way than yours? (By the way, I'm using IBM Spectrum LSF Standard 10.1.0.7) Best regards, David On 03.02.22 13:23, Bernd Dammann via users wrote: Hi David, On 03/02/2022 00:03 , David Perozzi wrote: Helo, I'm trying to run a code implemented with OpenMPI and OpenMP (for threading) on a large cluster that uses LSF for the job scheduling and dispatch. The problem with LSF is that it is not very straightforward to allocate and bind the right amount of threads to an MPI rank inside a single node. Therefore, I have to create a rankfile myself, as soon as the (a priori unknown) ressources are allocated. So, after my job get dispatched, I run: mpirun -n "$nslots" -display-allocation -nooversubscribe --map-by core:PE=1 --bind-to core mpi_allocation/show_numactl.sh >mpi_allocation/allocation_files/allocation.txt Just out of curiosity: why do you not use the built-in LSF features to do this mapping? Something like #BSUB -n 4 #BSUB -R "span[block=1] affinity[core(4)]" mpirun ./MyHybridApplication This will give you 4 cores for each of your 4 MPI ranks, and it sets OMP_NUM_THREADS=4 automatically. LSF's affinity is even more fine grained, so you can specify that the 4 cores should be on one socket (e.g. if your application is memory bound, and you want to make use of more memory bandwidth). Check the LSF documentation for more details. Examples: 1) with span[block=...] (allow LSF to place resources on one host) #BSUB -n 4 #BSUB -R "span[block=1] affinity[core(4)]" export OMP_DISPLAY_AFFINITY=true export OMP_AFFINITY_FORMAT="host: %H PID: %P TID: %n affinity: %A" mpirun --tag-output ./hello gives this output (sorted): [1,0]:host: node-23-8 PID: 2798 TID: 0 affinity: 0 [1,0]:host: node-23-8 PID: 2798 TID: 1 affinity: 1 [1,0]:host: node-23-8 PID: 2798 TID: 2 affinity: 2 [1,0]:host: node-23-8 PID: 2798 TID: 3 affinity: 3 [1,0]:Hello world from thread 0! [1,0]:Hello world from thread 1! [1,0]:Hello world from thread 2! [1,0]:Hello world from thread 3! [1,1]:host: node-23-8 PID: 2799 TID: 0 affinity: 4 [1,1]:host:
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi When i repeat i always got the huge discrepancy at the message size of 16384. May be there is a way to run mpi in verbose mode in order to further investigate this behaviour? Best Denis From: users on behalf of Benson Muite via users Sent: Monday, February 7, 2022 2:27:34 PM To: users@lists.open-mpi.org Cc: Benson Muite Subject: Re: [OMPI users] Using OSU benchmarks for checking Infiniband network Hi, Do you get similar results when you repeat the test? Another job could have interfered with your run. Benson On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: > Hi > > I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to > check/benchmark > > the infiniband network for our cluster. > > For that i use the collective all_reduce benchmark and run over 200 > nodes, using 1 process per node. > > And this is the results i obtained > > > > > > # OSU MPI Allreduce Latency Test v5.7.1 > # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations > 4 114.65 83.22147.981000 > 8 133.85106.47164.931000 > 16116.41 87.57150.581000 > 32112.17 93.25130.231000 > 64106.85 81.93134.741000 > 128 117.53 87.50152.271000 > 256 143.08115.63173.971000 > 512 130.34100.20167.561000 > 1024 155.67111.29188.201000 > 2048 151.82116.03198.191000 > 4096 159.11122.09199.241000 > 8192 176.74143.54221.981000 > 16384 48862.85 39270.21 54970.961000 > 327682737.37 2614.60 2802.681000 > 655362723.15 2585.62 2813.651000 > > > > Could someone explain me what is happening for message = 16384 ? > One can notice a huge latency (~ 300 time larger) compare to message > size = 8192. > I do not really understand what could create such an increase in the > latency. > The reason i use the OSU microbenchmarks is that we > sporadically experience a drop > in the bandwith for typical collective operations such as MPI_Reduce in > our cluster > which is difficult to understand. > I would be grateful if somebody can share its expertise or such problem > with me. > > Best, > Denis > > > > - > Denis Bertini > Abteilung: CIT > Ort: SB3 2.265a > > Tel: +49 6159 71 2240 > Fax: +49 6159 71 2986 > E-Mail: d.bert...@gsi.de > > GSI Helmholtzzentrum für Schwerionenforschung GmbH > Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de > > Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 > Managing Directors / Geschäftsführung: > Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock > Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: > Ministerialdirigent Dr. Volkmar Dietz >
Re: [OMPI users] Using OSU benchmarks for checking Infiniband network
Hi, Do you get similar results when you repeat the test? Another job could have interfered with your run. Benson On 2/7/22 3:56 PM, Bertini, Denis Dr. via users wrote: Hi I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to check/benchmark the infiniband network for our cluster. For that i use the collective all_reduce benchmark and run over 200 nodes, using 1 process per node. And this is the results i obtained # OSU MPI Allreduce Latency Test v5.7.1 # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 4 114.65 83.22 147.98 1000 8 133.85 106.47 164.93 1000 16 116.41 87.57 150.58 1000 32 112.17 93.25 130.23 1000 64 106.85 81.93 134.74 1000 128 117.53 87.50 152.27 1000 256 143.08 115.63 173.97 1000 512 130.34 100.20 167.56 1000 1024 155.67 111.29 188.20 1000 2048 151.82 116.03 198.19 1000 4096 159.11 122.09 199.24 1000 8192 176.74 143.54 221.98 1000 16384 48862.85 39270.21 54970.96 1000 32768 2737.37 2614.60 2802.68 1000 65536 2723.15 2585.62 2813.65 1000 Could someone explain me what is happening for message = 16384 ? One can notice a huge latency (~ 300 time larger) compare to message size = 8192. I do not really understand what could create such an increase in the latency. The reason i use the OSU microbenchmarks is that we sporadically experience a drop in the bandwith for typical collective operations such as MPI_Reduce in our cluster which is difficult to understand. I would be grateful if somebody can share its expertise or such problem with me. Best, Denis - Denis Bertini Abteilung: CIT Ort: SB3 2.265a Tel: +49 6159 71 2240 Fax: +49 6159 71 2986 E-Mail: d.bert...@gsi.de GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: Ministerialdirigent Dr. Volkmar Dietz
[OMPI users] Using OSU benchmarks for checking Infiniband network
Hi I am using OSU microbenchmarks compiled with openMPI 3.1.6 in order to check/benchmark the infiniband network for our cluster. For that i use the collective all_reduce benchmark and run over 200 nodes, using 1 process per node. And this is the results i obtained # OSU MPI Allreduce Latency Test v5.7.1 # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 4 114.65 83.22147.981000 8 133.85106.47164.931000 16116.41 87.57150.581000 32112.17 93.25130.231000 64106.85 81.93134.741000 128 117.53 87.50152.271000 256 143.08115.63173.971000 512 130.34100.20167.561000 1024 155.67111.29188.201000 2048 151.82116.03198.191000 4096 159.11122.09199.241000 8192 176.74143.54221.981000 16384 48862.85 39270.21 54970.961000 327682737.37 2614.60 2802.681000 655362723.15 2585.62 2813.651000 Could someone explain me what is happening for message = 16384 ? One can notice a huge latency (~ 300 time larger) compare to message size = 8192. I do not really understand what could create such an increase in the latency. The reason i use the OSU microbenchmarks is that we sporadically experience a drop in the bandwith for typical collective operations such as MPI_Reduce in our cluster which is difficult to understand. I would be grateful if somebody can share its expertise or such problem with me. Best, Denis - Denis Bertini Abteilung: CIT Ort: SB3 2.265a Tel: +49 6159 71 2240 Fax: +49 6159 71 2986 E-Mail: d.bert...@gsi.de GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: Ministerialdirigent Dr. Volkmar Dietz
[hwloc-users] glibc struggling with get_nprocs and get_nprocs_conf
Hello, For information, glibc is struggling with the problematic of the precise meaning of get_nprocs, get_nprocs_conf, _SC_NPROCESSORS_CONF, _SC_NPROCESSORS_ONLN https://sourceware.org/pipermail/libc-alpha/2022-February/136177.html Samuel ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users