Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?
Hi Devendar, Thank again you for your answer. I searched a little bit and found that UD stands for "Unreliable Datagram" while RC is for "Reliable Connected" transport mechanism. I found another called DC for "Dynamically Connected" which is not supported on our HCA. Do you know what is basically the difference between them ? I didn't find any information about this. Which one is used by btl=openib (iverb), is it RC ? Also are they all standard or some of them are supported only by Mellanox ? I will try to convince the admin of the system I'm using to increase the maximal shared segment size (SHMMAX). I guess what we have (e.g. 32 MB) is the default. But I didn't find any document suggesting that we should increase SHMMAX for helping MXM. This is a bit odd, if it's important, it should be mentioned in Mellanox documentation at least. I will check at the messaging rate benchmark osu_mbw_mr for sure to see if its result are improved by MXM. After looking at the MPI performance results published on your URL (e.g. latencies around 1 us in native mode), I'm more and more convinced that our results are suboptimal. And after seeing the impact of SR-IOV published on your URL, I suspect more and more that our mediocre latency is caused by this mechanism. But our cluster is different: SR-IOV is not used in the context of Virtual Machines running under a host VMM. SR-IOV is used with Linux LXC containers. Martin Audet > Hi Martin > > MXM default transport is UD (MXM_TLS=*ud*,shm,self), which is scalable when > running with large applications. RC(MXM_TLS=*rc,*shm,self) is recommended > for microbenchmarks and very small scale applications, > > yes, max seg size setting is too small. > > Did you check any message rate benchmarks(like osu_mbw_mr) with MXM? > > virtualization env will have some overhead. see some perf comparision here > with mvapich > http://mvapich.cse.ohio-state.edu/performance/v-pt_to_pt/ . ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?
Hi Devendar, Thank you for your answer. Setting MXM_TLS=rc,shm,self does improve the speed of MXM (both latency and bandwidth): without MXM_TLS comm lat_min bw_max bw_max pingpong pingpongsendrecv (us) (MB/s) (MB/s) --- openib 1.79 5827.9311552.4 mxm2.23 5191.77 8201.76 yalla 2.18 5200.55 8109.48 with MXM_TLS=rc,shm,self comm lat_min bw_max bw_max pingpong pingpongsendrecv (us) (MB/s) (MB/s) --- openib 1.79 6021.8311529 mxm1.78 5936.9211168.5 yalla 1.78 5944.8611375 Note 1: MXM_RDMA_PORTS=mlx4_0:1 and the MCA parameter btl_openib_include_if=mlx4_0 for both cases. Note 2: The bandwidth reported are not very accurate. Bandwidth results can vary easilly by 7% from one run to another. We see that the performance of MXM is now very similar to the performance of openib for these IMB tests. However an error is now reported a few times when MXM_TLS is set: sys.c:468 MXM ERROR A new segment was to be created and size < SHMMIN or size > SHMMAX, or the new segment was to be created. A segment with given key existed, but size is greater than the size of that segment. Please check limits by 'ipcs -l'. "ipcs -l" reports among other things that: max seg size (kbytes) = 32768 By the way, is it too small ? Now if we run /opt/mellanox/mxm/mxm_perftest we get: without with MXM_TLS MXM_TLS avg send_lat(us)1.6261.321 avg send_bw -s 400(MB/s) 5219.51 5514.04 avg bidir send_bw -s 400 -b (MB/s) 5283.13 5514.45 Note: the -b for bidirectional bandwith doesn't seen to affect the result. Again it is an improvement both in term of latency and bandwidth. However a warning is reported when MXM_TLS is set on the server side when the send_lat test is run: icb_ep.c:287 MXM WARN The min value for CIB_RX_QUEUE_LEN is 2048. Note: setting the undocumented env variable MXM_CIB_RX_QUEUE_LEN=2048 remove the warning but doesn't affect the send latency. * * * So now the results are better: MXM performs as well as the regular openib in term of latency and bandwidth (I didn't checked the overlap capacity though). But I'm not really impressed. I was expecting MXM (especially when used by yalla) to be a little better than openib. Also the latency of both openib, mxm and yalla at 1.8 us seems to be too high. With a configuration like ours, we should get something closer to 1 us. Does anyone has an idea ? Don't forget that this cluster uses LXC containers with SR-IOV enabled for the Infiniband adapter. Martin Audet > Hi Martin, > > Can you check if it is any better with "-x MXM_TLS=rc,shm,self" ? > > -Devendar ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?
Hi Josh, Thanks for your reply. I did try setting MXM_RDMA_PORTS=mlx4_0:1 for all my MPI processes and it did improve performance but the performance I obtain isn't completely satisfying. When I use IMB 4.1 pingpong and sendrecv benchmarks between two nodes I get using Open MPI 1.10.3: without MXM_RDMA_PORTS comm lat_min bw_max bw_max pingpong pingpongsendrecv (us) (MB/s) (MB/s) --- openib 1.79 5947.0711534 mxm2.51 5166.96 8079.18 yalla 2.47 5167.29 8278.15 with MXM_RDMA_PORTS=mlx4_0:1 comm lat_min bw_max bw_max pingpong pingpongsendrecv (us) (MB/s) (MB/s) --- openib 1.79 5827.9311552.4 mxm2.23 5191.77 8201.76 yalla 2.18 5200.55 8109.48 openib means: pml=ob1 btl=openib,vader,self btl_openib_include_if=mlx4_0 mxmmeans: pml=cm,ob1 mtl=mxm btl=vader,self yalla means: pml=yalla,ob1 btl=vader,self lspci reports for our FDR Infiniband HCA: Infiniband Controler: Mellanox Technologies MT27500 Family [ConnectX-3] and 16 lines like: Infiniband Controler: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function] the nodes use two octacore Xeon E5-2650v2 Ivybridge-EP 2.67 GHz sockets ofed_info reports that mxm version is 3.4.3cce223-0.32200 As you can see the results are not very good. I would expect mxm and yalla to perform better than openib both in term of latency and bandwidth (note: sendrecv bandwidth is full duplex). I would expect the yalla bandwidth to be around 1.1 us like shown here https://www.open-mpi.org/papers/sc-2014/Open-MPI-SC14-BOF.pdf (page 33). I also ran mxm_perftest (located in /opt/mellanox/bin) and it reports the following latency between two nodes: without MXM_RDMA_PORTS1.92 us withMXM_RDMA_PORTS=mlx4_0:1 1.65 us Again I think we can expect a better latency with our configuration. 1.65 us is not a very good result. Note however that the 0.27 us (1.92 - 1.65 = 0.27) reduction reduction in raw mxm latency correspond to the above Open MPI latencies observed with mxm (2.51 - 2.23 = 0.28) and yalla (2.47 - 2.18 = 0.29). Another detail: everything is run inside LXC containers. Also SR-IOV is probably used. Does anyone has any idea what's wrong with our cluster ? Martin Audet > Hi, Martin > > The environment variable: > > MXM_RDMA_PORTS=device:port > > is what you're looking for. You can specify a device/port pair on your OMPI > command line like: > > mpirun -np 2 ... -x MXM_RDMA_PORTS=mlx4_0:1 ... > > > Best, > > Josh ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?
Hi OMPI_Users && OMPI_Developers, Is there an equivalent to the MCA parameter btl_openib_include_if when using MXM over Infiniband (e.g. either (pml=cm mtl=mxm) or (pml=yalla)) ? I ask this question because I'm working on a cluster where LXC containers are used on compute nodes (with SR-IOV I think) and multiple mlx4 interfaces are reported by lstopo (e.g. mlx4_0, mlx4_1, ..., mlx4_16) even if a single physical Mellanox Connect-X3 HCA is present per node. I found that when I use the plain openib btl (e.g. (pml=ob1 btl=openib)), it is much faster if I specify the MCA parameter btl_openib_include_if=mlx4_0 to force Open MPI to use a single interface. By doing that the latency is lower while the bandwidth higher. I guess it is because otherwise Open MPI mess by trying to use all "virtual" interfaces at once. However we all know that MXM is better than plain openib since it allows the HCAs to perform message matching, transfer message in the background and provide communication progress. So in this case is there a way to use only mlx4_0 ? I mean when using mxm mtl (pml=cm mtl=mxm) or preferably using it more directly by yalla pml (pml=yalla). Note I'm using Open MPI 1.10.3 I compiled myself for now but I could use instead Open MPI 2.0 if necessary . Thanks, Martin Audet ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Ability to overlap communication and computation on Infiniband
Hi OMPI_Users and OMPI_Developers, I would like someone to verify if my understanding is correct concerning Open MPI ability to overlap communication and computations on Infiniband when using non-blocking MPI_Isend() and MPI_Irecv() functions (i.e. the computation is done between the non-blocking MPI_Isend() on the sender or MPI_Irecv() on the receiver and the corresponding MPI_Wait()). After reading the following FAQ entries: https://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.2 https://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.3 and the paper: https://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/ about the algorithm used on OpenFabric to send large messages my understanding is that: 1- When the "RDMA Direct" message protocol is used, the communication is done by an RDMA read on the receiver side so if the receiver calls MPI_Irecv() after it received a matching message envelope (tag, communicator) from the sender, then the receiver can start the RDMA read and let the Infiniband HCA operate and return from the MPI_Irecv() to let the receiving process compute. Then the next time the MPI library is called on the receiver side (or maybe in the corresponding MPI_Wait() call), the receiver sends a short ACK message to the sender to tell the sender the that the receive is completed and it is now free to do whatever it wants with the send buffer. When things happens this way (e.g. sender envelope received before MPI_Irecv() is called on the receiver side), it offers a great overlap potential on both receiver and sender side (because for the sender MPI_Isend() only have to send the envelope eagerly and its MPI_Wait() wait for the ACK). However when the receiver call MPI_Irecv() before the sender envelope is received, the RDMA read transfer cannot start before the envelope is received and the MPI library realize it can start the RDMA read. If the receiver only realize this in the corresponding MPI_Wait(), there will be no overlap on the receiver side. The overlap potential is still good on the sender side for the same reason as the previous case. 2- When the "RDMA Pipeline" protocol is used both sender and receiver side have to actively cooperate to transfer data using multiple Infiniband send/receive and RDMA writes. On the receiver side as the article says: "protocol effectively overlaps the cost of registration/deregistration with RDMA writes". This allows to overlap communication with registration overhead on the receiver side but not with computations. On the sender side I don't see how overlap with computation could be possible either. In practice when using this protocol is used between a pair of MPI_Isend() and MPI_Irecv() I fear that all the communication will happen when the sender and receiver reach their corresponding MPI_Wait() calls (which means no overlap). So if someone could tell me if this is correct or not I would appreciate greatly. I guess that the two above protocols correspond to the basic BTL/openib framework/component. When a more modern MTL/mxm or PML/yall framework/component is used, I hope things are different and result in more communication/computation overlap potential. Thanks in advance, Martin Audet
[OMPI users] Experience with MXM, yalla, FCA and HCOLL with Mellanox HCA ?
Hi Open MPI Users and Developers, I would like to know your experience with the optional middleware and the corresponding Open MPI framework/components for recent Mellanox Infiniband HCA, especially concerning MXM, FCA (the latest versions bring HCOLL I think) and the related Open MPI framework/components such as the MTL/mxm, PML/yalla, the COLL/fca and COLL/hcoll. Does MXM when used with MTL/mxm or PML/yalla really improve communication speed over the plain BTL/openib ? Especially since MXM allows matching message tags, I suppose that in addition to improving a little the usual latency/bandwidth metrics, it would increase the communication/computation overlap potential when used with non-blocking MPI calls since the adapter is more autonomous. I remember that with old Myrinet networks, the matching MX middleware for our application was way better than the earlier non-matching GM middleware. I guess it is the same thing now with Infiniband / OpenFabric networks. Matching middleware should therefore be better. Also concerning FCA and HCOLL, do they really improve the speed of the collective operations ? >From the Mellanox documentation I saw they are supposed to use hardware >broadcast and take into account the topology to favor the faster connections >between process located on the same nodes. I also saw in these documents that >recent version of FCA is able to perform the reduction operations on the HCA >itself, even the floating point ones. This should greatly improve the speed of >MPI_Allreduce() in our codes ! So for those lucky who have access to a recent well configured Mellanox Infiniband cluster with recent middleware and an Open MPI library well configured to take advantage of this, does it deliver its promises ? The only documentation/reports I could find on Internet on these subjects are from Mellanox in addition to this for PML/yalla and MTL/mxm (slide 32): https://www.open-mpi.org/papers/sc-2014/Open-MPI-SC14-BOF.pdf Thanks in advance, Martin Audet
Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?
Thanks Jeff and Alex for your answers and comments. mlockall(), especially with the MCL_FUTURE argument is indeed interesting. Thanks Jeff for your clarification of what memory registration really means (e.g. locking and telling the network stack the virtual to physical mapping). Also concerning the ummunotify kernel module, I would like to point out that while the link sent to github bug report suggests it is problematic, the top level Open MPI README file still recommends it. Should the README file need to be updated ? Regards, Martin Audet
Re: [OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?
Thanks Jeff for your answer. It is sad that the approach I mentioned of having all memory registered for user process on cluster nodes didn't become more popular. I still believe that such an approach would shorten the executed code path in MPI libraries, reduce message latency, increase the communication/computation overlap potential and allows communication progress more naturally. But now since we have to live with memory registration issues, what changes should be done to standard Linux distro so that Open MPI can best use a recent Mellanox Infiniband network ? I guess that installing the ummunotify kernel module is a good idea ? Maybe also removing the limits on the "max locked memory" (ulimit -l) is also good ? Beside that, I guess that installing the latest OFED (to have the latest middleware) instead of using the default one coming with the Linux distro is a good idea ? Also does the XPMEM kernel module for more efficient intra node transfer of large message worth installing since kernels now include the CMA API ? Thanks, Martin Audet
[OMPI users] Avoiding the memory registration costs by having memory always registered, is it possible with Linux ?
Hi, After reading a little the FAQ on the methods used by Open MPI to deal with memory registration (or pinning) with Infiniband adapter, it seems that we could avoid all the overhead and complexity of memory registration/deregistration, registration cache access and update, memory management (ummunotify) in addition to allowing a better overlap of the communications with the computations (we could let the communication hardware do its job independently without resorting to registration/transfer/deregistration pipelines) by simply having all user process memory registered all the time. Of course a configuration like that is not appropriate in a general setting (ex: a desktop environment) as it would make swapping almost impossible. But in the context of an HPC node where the processes are not supposed to swap and the OS not overcommit memory, not being able to swap doesn't appear to be a problem. Moreover since the maximal total memory used per process is often predefined at the application start as a resource specified to the queuing system, the OS could easily keep a defined amount of extra memory for its own need instead of swapping out user process memory. I guess that specialized (non-Linux) compute node OS does this. But is it possible and does it make sense with Linux ? Thanks, Martin Audet
Re: [OMPI users] MPI_Comm_accept() / MPI_Comm_connect() fail between two different machines
Yes, this patch applied over OpenMPI 1.8.6 solves my problem. Attached are the new output files for the server and the client when started with "--mca oob_base_verbose 100". Will this patch be included in 1.8.7 ? Thanks again, Martin Audet From: users [users-boun...@open-mpi.org] On Behalf Of Ralph Castain [r...@open-mpi.org] Sent: Tuesday, July 14, 2015 11:10 AM To: Open MPI Users Subject: Re: [OMPI users] MPI_Comm_accept() / MPI_Comm_connect() fail between two different machines This seems to fix the problem when using your example on my cluster - please let me know if it solves things for you server_out2.txt.bz2 Description: server_out2.txt.bz2 client_out2.txt.bz2 Description: client_out2.txt.bz2
Re: [OMPI users] MPI_Comm_accept() / MPI_Comm_connect() fail between two different machines
I will happily test any patch you send me to fix this problem. Thanks, Martin -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: July 13, 2015 22:55 To: Open MPI Users Subject: Re: [OMPI users] MPI_Comm_accept() / MPI_Comm_connect() fail between two different machines I see the problem - it's a race condition, actually. I'll try to provide a patch for you to test, if you don't mind. > On Jul 13, 2015, at 3:03 PM, Audet, Martin > wrote: > > Thanks Ralph for this quick response. > > In the two attachements you will find the output I got when running the > following commands: > > [audet@fn1 mpi]$ mpiexec --mca oob_base_verbose 100 -n 1 > ./simpleserver 2>&1 | tee server_out.txt > > [audet@linux15 mpi]$ mpiexec --mca oob_base_verbose 100 -n 1 > ./simpleclient > '227264.0;tcp://172.17.15.20:56377+227265.0;tcp://172.17.15.20 > :34776:300' 2>&1 | tee client_out.txt > > Martin > > From: users [users-boun...@open-mpi.org] On Behalf Of Ralph Castain > [r...@open-mpi.org] > Sent: Monday, July 13, 2015 5:29 PM > To: Open MPI Users > Subject: Re: [OMPI users] MPI_Comm_accept() / MPI_Comm_connect() fail > between two different machines > > Try running it with "-mca oob_base_verbose 100" on both client and server - > it will tell us why the connection was refused. > > >> On Jul 13, 2015, at 2:14 PM, Audet, Martin >> wrote: >> >> Hi OMPI_Developers, >> >> It seems that I am unable to establish an MPI communication between two >> independently started MPI programs using the simplest client/server call >> sequence I can imagine (see the two attached files) when the client and >> server process are started on different machines. Note that I have no >> problems when the client and server program run on the same machine. >> >> For example if I do the following on the server machine (running on fn1): >> >> [audet@fn1 mpi]$ mpicc -Wall simpleserver.c -o simpleserver >> [audet@fn1 mpi]$ mpiexec -n 1 ./simpleserver Server port = >> '3054370816.0;tcp://172.17.15.20:54458+3054370817.0;tcp://172.17.15.20:58943:300' >> >> The server prints its port (created with MPI_Open_port()) and wait for a >> connection by calling MPI_Comm_accept(). >> >> Now on the client machine (running on linux15) if I compile the client and >> run it with the above port address on the command line, I get: >> >> [audet@linux15 mpi]$ mpicc -Wall simpleclient.c -o simpleclient >> [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient >> '3054370816.0;tcp://172.17.15.20:54458+3054370817.0;tcp://172.17.15.20:58943:300' >> trying to connect... >> >> A process or daemon was unable to complete a TCP connection to >> another process: >> Local host:linux15 >> Remote host: linux15 >> This is usually caused by a firewall on the remote host. Please check >> that any firewall (e.g., iptables) has been disabled and try again. >> >> [linux15:24193] [[13075,0],0]-[[46606,0],0] >> mca_oob_tcp_peer_send_handler: invalid connection state (6) on socket >> 16 >> >> And then I have to stop the client program by pressing ^C (and also the >> server which doesn't seems affected). >> >> What's wrong ? >> >> And I am almost sure there is no firewall running on linux15. >> >> It is not the first MPI client/server application I am developing (with both >> OpenMPI and mpich). >> These simple MPI client/server programs work well with mpich (version 3.1.3). >> >> This problem happens with both OpenMPI 1.8.3 and 1.8.6 >> >> linux15 and fn1 run both on Fedora Core 12 Linux (64 bits) and are connected >> by a Gigabit Ethernet (the normal network). >> >> And again if client and server run on the same machine (either fn1 or >> linux15) no such problems happens. >> >> Thanks in advance, >> >> Martin >> Audet >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/07/27271.php > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/07/27272.php > __ > _ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/07/27273.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27274.php
Re: [OMPI users] MPI_Comm_accept() / MPI_Comm_connect() fail between two different machines
Thanks Ralph for this quick response. In the two attachements you will find the output I got when running the following commands: [audet@fn1 mpi]$ mpiexec --mca oob_base_verbose 100 -n 1 ./simpleserver 2>&1 | tee server_out.txt [audet@linux15 mpi]$ mpiexec --mca oob_base_verbose 100 -n 1 ./simpleclient '227264.0;tcp://172.17.15.20:56377+227265.0;tcp://172.17.15.20:34776:300' 2>&1 | tee client_out.txt Martin From: users [users-boun...@open-mpi.org] On Behalf Of Ralph Castain [r...@open-mpi.org] Sent: Monday, July 13, 2015 5:29 PM To: Open MPI Users Subject: Re: [OMPI users] MPI_Comm_accept() / MPI_Comm_connect() fail between two different machines Try running it with “—mca oob_base_verbose 100” on both client and server - it will tell us why the connection was refused. > On Jul 13, 2015, at 2:14 PM, Audet, Martin > wrote: > > Hi OMPI_Developers, > > It seems that I am unable to establish an MPI communication between two > independently started MPI programs using the simplest client/server call > sequence I can imagine (see the two attached files) when the client and > server process are started on different machines. Note that I have no > problems when the client and server program run on the same machine. > > For example if I do the following on the server machine (running on fn1): > > [audet@fn1 mpi]$ mpicc -Wall simpleserver.c -o simpleserver > [audet@fn1 mpi]$ mpiexec -n 1 ./simpleserver > Server port = > '3054370816.0;tcp://172.17.15.20:54458+3054370817.0;tcp://172.17.15.20:58943:300' > > The server prints its port (created with MPI_Open_port()) and wait for a > connection by calling MPI_Comm_accept(). > > Now on the client machine (running on linux15) if I compile the client and > run it with the above port address on the command line, I get: > > [audet@linux15 mpi]$ mpicc -Wall simpleclient.c -o simpleclient > [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient > '3054370816.0;tcp://172.17.15.20:54458+3054370817.0;tcp://172.17.15.20:58943:300' > trying to connect... > > A process or daemon was unable to complete a TCP connection > to another process: > Local host:linux15 > Remote host: linux15 > This is usually caused by a firewall on the remote host. Please > check that any firewall (e.g., iptables) has been disabled and > try again. > > [linux15:24193] [[13075,0],0]-[[46606,0],0] mca_oob_tcp_peer_send_handler: > invalid connection state (6) on socket 16 > > And then I have to stop the client program by pressing ^C (and also the > server which doesn't seems affected). > > What's wrong ? > > And I am almost sure there is no firewall running on linux15. > > It is not the first MPI client/server application I am developing (with both > OpenMPI and mpich). > These simple MPI client/server programs work well with mpich (version 3.1.3). > > This problem happens with both OpenMPI 1.8.3 and 1.8.6 > > linux15 and fn1 run both on Fedora Core 12 Linux (64 bits) and are connected > by a Gigabit Ethernet (the normal network). > > And again if client and server run on the same machine (either fn1 or > linux15) no such problems happens. > > Thanks in advance, > > Martin > Audet___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/07/27271.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27272.php [fn1:07315] mca: base: components_register: registering oob components [fn1:07315] mca: base: components_register: found loaded component tcp [fn1:07315] mca: base: components_register: component tcp register function successful [fn1:07315] mca: base: components_open: opening oob components [fn1:07315] mca: base: components_open: found loaded component tcp [fn1:07315] mca: base: components_open: component tcp open function successful [fn1:07315] mca:oob:select: checking available component tcp [fn1:07315] mca:oob:select: Querying component [tcp] [fn1:07315] oob:tcp: component_available called [fn1:07315] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 [fn1:07315] [[37299,0],0] oob:tcp:init rejecting loopback interface lo [fn1:07315] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 [fn1:07315] [[37299,0],0] oob:tcp:init adding 172.17.15.20 to our list of V4 connections [fn1:07315] [[37299,0],0] TCP STARTUP [
[OMPI users] MPI_Comm_accept() / MPI_Comm_connect() fail between two different machines
Hi OMPI_Developers, It seems that I am unable to establish an MPI communication between two independently started MPI programs using the simplest client/server call sequence I can imagine (see the two attached files) when the client and server process are started on different machines. Note that I have no problems when the client and server program run on the same machine. For example if I do the following on the server machine (running on fn1): [audet@fn1 mpi]$ mpicc -Wall simpleserver.c -o simpleserver [audet@fn1 mpi]$ mpiexec -n 1 ./simpleserver Server port = '3054370816.0;tcp://172.17.15.20:54458+3054370817.0;tcp://172.17.15.20:58943:300' The server prints its port (created with MPI_Open_port()) and wait for a connection by calling MPI_Comm_accept(). Now on the client machine (running on linux15) if I compile the client and run it with the above port address on the command line, I get: [audet@linux15 mpi]$ mpicc -Wall simpleclient.c -o simpleclient [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient '3054370816.0;tcp://172.17.15.20:54458+3054370817.0;tcp://172.17.15.20:58943:300' trying to connect... A process or daemon was unable to complete a TCP connection to another process: Local host:linux15 Remote host: linux15 This is usually caused by a firewall on the remote host. Please check that any firewall (e.g., iptables) has been disabled and try again. [linux15:24193] [[13075,0],0]-[[46606,0],0] mca_oob_tcp_peer_send_handler: invalid connection state (6) on socket 16 And then I have to stop the client program by pressing ^C (and also the server which doesn't seems affected). What's wrong ? And I am almost sure there is no firewall running on linux15. It is not the first MPI client/server application I am developing (with both OpenMPI and mpich). These simple MPI client/server programs work well with mpich (version 3.1.3). This problem happens with both OpenMPI 1.8.3 and 1.8.6 linux15 and fn1 run both on Fedora Core 12 Linux (64 bits) and are connected by a Gigabit Ethernet (the normal network). And again if client and server run on the same machine (either fn1 or linux15) no such problems happens. Thanks in advance, Martin Audet#include #include #include int main(int argc, char **argv) { int comm_rank; char port_name[MPI_MAX_PORT_NAME]; MPI_Comm intercomm; int ok_flag; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank); ok_flag = (comm_rank != 0) || (argc == 1); MPI_Bcast(&ok_flag, 1, MPI_INT, 0, MPI_COMM_WORLD); if (!ok_flag) { if (comm_rank == 0) { fprintf(stderr,"Usage: %s\n",argv[0]); } MPI_Abort(MPI_COMM_WORLD, 1); } MPI_Open_port(MPI_INFO_NULL, port_name); if (comm_rank == 0) { printf("Server port = '%s'\n", port_name); } MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm); MPI_Close_port(port_name); if (comm_rank == 0) { printf("MPI_Comm_accept() sucessful...\n"); } MPI_Comm_disconnect(&intercomm); MPI_Finalize(); return EXIT_SUCCESS; } #include #include #include #include int main(int argc, char **argv) { int comm_rank; int ok_flag; MPI_Comm intercomm; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank); ok_flag = (comm_rank != 0) || ((argc == 2) && argv[1] && (*argv[1] != '\0')); MPI_Bcast(&ok_flag, 1, MPI_INT, 0, MPI_COMM_WORLD); if (!ok_flag) { if (comm_rank == 0) { fprintf(stderr,"Usage: %s mpi_port\n", argv[0]); } MPI_Abort(MPI_COMM_WORLD, 1); } if (comm_rank == 0) { printf("trying to connect...\n"); } while (MPI_Comm_connect((comm_rank == 0) ? argv[1] : 0, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm) != MPI_SUCCESS) { if (comm_rank == 0) { printf("MPI_Comm_connect() failled, sleeping and retrying...\n"); } sleep(1); } if (comm_rank == 0) { printf("MPI_Comm_connect() sucessful...\n"); } MPI_Comm_disconnect(&intercomm); MPI_Finalize(); return EXIT_SUCCESS; }
Re: [OMPI users] Unable to connect to a server using MX MTL with TCP
Thanks to both Scott and Jeff ! Next time I have a problem, I will check the README file first (Doh !). Also we might mitigate the problem by connecting the workstation to the Myrinet switch. Martin -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: June 9, 2010 15:34 To: Open MPI Users Subject: Re: [OMPI users] Unable to connect to a server using MX MTL with TCP On Jun 5, 2010, at 7:52 AM, Scott Atchley wrote: > I do not think this is a supported scenario. George or Jeff can correct me, > but when you use the MX MTL you are using the pml cm and not the pml ob1. The > BTLs are part of ob1. When using the MX MTL, it cannot use the TCP BTL. > > You only solution would be to use the MX BTL. Sorry for the delayed reply. Scott is correct; the MX MTL uses the "cm" PML. The "cm" PML can only use *one* MTL at a time (little known fact of Open MPI lore: "cm" stands for several things, one of which is "Connor MacLeod" -- there can only be one). Here's a chunk of text from the README: - There are three MPI network models available: "ob1", "csum", and "cm". "ob1" and "csum" use BTL ("Byte Transfer Layer") components for each supported network. "cm" uses MTL ("Matching Tranport Layer") components for each supported network. - "ob1" supports a variety of networks that can be used in combination with each other (per OS constraints; e.g., there are reports that the GM and OpenFabrics kernel drivers do not operate well together): - OpenFabrics: InfiniBand and iWARP - Loopback (send-to-self) - Myrinet: GM and MX (including Open-MX) - Portals - Quadrics Elan - Shared memory - TCP - SCTP - uDAPL - "csum" is exactly the same as "ob1", except that it performs additional data integrity checks to ensure that the received data is intact (vs. trusting the underlying network to deliver the data correctly). csum supports all the same networks as ob1, but there is a performance penalty for the additional integrity checks. - "cm" supports a smaller number of networks (and they cannot be used together), but may provide better better overall MPI performance: - Myrinet MX (including Open-MX, but not GM) - InfiniPath PSM - Portals Open MPI will, by default, choose to use "cm" when the InfiniPath PSM MTL can be used. Otherwise, "ob1" will be used and the corresponding BTLs will be selected. "csum" will never be selected by default. Users can force the use of ob1 or cm if desired by setting the "pml" MCA parameter at run-time: shell$ mpirun --mca pml ob1 ... or shell$ mpirun --mca pml csum ... or shell$ mpirun --mca pml cm ... -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] RE : Unable to connect to a server using MX MTL with TCP
Sorry, I forgot the attachements... Martin De : users-boun...@open-mpi.org [users-boun...@open-mpi.org] de la part de Audet, Martin [martin.au...@imi.cnrc-nrc.gc.ca] Date d'envoi : 4 juin 2010 19:18 À : us...@open-mpi.org Objet : [OMPI users] Unable to connect to a server using MX MTL with TCP Hi OpenMPI_Users and OpenMPI_Developers, I'm unable to connect a client application using MPI_Comm_connect() to a server job (the server job calls MPI_Open_port() before calling by MPI_Comm_accept()) when the server job uses MX MTL (although it works without problems when the server uses MX BTL). The server job runs on a cluster connected to a Myrinet 10G network (MX 1.2.11) in addition to an ordinary Ethernet network. The client runs on a different machine, not connected to the Myrinet network but accessible via the Ethernet network. Joined to this message are the simple server and client programs (87 lines total) called simpleserver.c and simpleclient.c . Note we are using OpenMPI 1.4.2 on x86_64 Linux (server: Fedora 7 client: Fedora 12). Compiling these programs with mpicc on the server front node (fn1) and client workstation (linux15) works well: [audet@fn1 bench]$ mpicc simpleserver.c -o simpleserver [audet@linux15 mpi]$ mpicc simpleclient.c -o simpleclient Then if we start the server on the cluster (job is started on cluster node cn18) and asking to use MTL : [audet@fn1 bench]$ mpiexec -x MX_RCACHE=2 -machinefile machinefile_cn18 --mca mtl mx --mca pml cm -n 1 ./simpleserver It prints the server port (Note we uses MX_RCACHE=2 to avoid a warning but it doesn't affect the current issue) : Server port = '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300' Then starting the client on the workstation with this port number: [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300' The server process core dump as follow: MPI_Comm_accept() sucessful... [cn18:24582] *** Process received signal *** [cn18:24582] Signal: Segmentation fault (11) [cn18:24582] Signal code: Address not mapped (1) [cn18:24582] Failing at address: 0x38 [cn18:24582] [ 0] /lib64/libpthread.so.0 [0x305de0dd20] [cn18:24582] [ 1] /usr/local/openmpi-1.4.2/lib/openmpi/mca_mtl_mx.so [0x2d6a7e6d] [cn18:24582] [ 2] /usr/local/openmpi-1.4.2/lib/openmpi/mca_pml_cm.so [0x2d4a319d] [cn18:24582] [ 3] /usr/local/openmpi/lib/libmpi.so.0(ompi_dpm_base_disconnect_init+0xbf) [0x2ab1403f] [cn18:24582] [ 4] /usr/local/openmpi-1.4.2/lib/openmpi/mca_dpm_orte.so [0x2ed0eb19] [cn18:24582] [ 5] /usr/local/openmpi/lib/libmpi.so.0(PMPI_Comm_disconnect+0xa0) [0x2aaf4f20] [cn18:24582] [ 6] ./simpleserver(main+0x14c) [0x400d04] [cn18:24582] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x305ce1daa4] [cn18:24582] [ 8] ./simpleserver [0x400b09] [cn18:24582] *** End of error message *** -- mpiexec noticed that process rank 0 with PID 24582 on node cn18 exited on signal 11 (Segmentation fault). -- [audet@fn1 bench]$ And the client stops with the following error message: -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[31386,1],0]) is on host: linux15 Process 2 ([[54152,1],0]) is on host: cn18 BTLs attempted: self sm tcp Your MPI job is now going to abort; sorry. -- MPI_Comm_connect() sucessful... Error in comm_disconnect_waitall [audet@linux15 mpi]$ I really don't understand this message because the client can connect with the server using tcp on Ethernet. Moreover if I add MCA options when I start the server to include TCP BTL, the same problems happens (the argument list then becomes: '--mca mtl mx --mca pml cm --mca btl tcp,shared,self' ). However if I remove all MCA options when I start the server (e.g. when BTL MX is used), no such problems appears. Everything goes fine also if I start the server with an explicit request to use BTL MX and TCP (e.g. with options '--mca btl mx,tcp,sm,self'). Four running our server application we really prefer to use MX MTL over MX BTL since it is much faster with MTL (although the usual ping pong test is only slightly fa
[OMPI users] Unable to connect to a server using MX MTL with TCP
Hi OpenMPI_Users and OpenMPI_Developers, I'm unable to connect a client application using MPI_Comm_connect() to a server job (the server job calls MPI_Open_port() before calling by MPI_Comm_accept()) when the server job uses MX MTL (although it works without problems when the server uses MX BTL). The server job runs on a cluster connected to a Myrinet 10G network (MX 1.2.11) in addition to an ordinary Ethernet network. The client runs on a different machine, not connected to the Myrinet network but accessible via the Ethernet network. Joined to this message are the simple server and client programs (87 lines total) called simpleserver.c and simpleclient.c . Note we are using OpenMPI 1.4.2 on x86_64 Linux (server: Fedora 7 client: Fedora 12). Compiling these programs with mpicc on the server front node (fn1) and client workstation (linux15) works well: [audet@fn1 bench]$ mpicc simpleserver.c -o simpleserver [audet@linux15 mpi]$ mpicc simpleclient.c -o simpleclient Then if we start the server on the cluster (job is started on cluster node cn18) and asking to use MTL : [audet@fn1 bench]$ mpiexec -x MX_RCACHE=2 -machinefile machinefile_cn18 --mca mtl mx --mca pml cm -n 1 ./simpleserver It prints the server port (Note we uses MX_RCACHE=2 to avoid a warning but it doesn't affect the current issue) : Server port = '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300' Then starting the client on the workstation with this port number: [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300' The server process core dump as follow: MPI_Comm_accept() sucessful... [cn18:24582] *** Process received signal *** [cn18:24582] Signal: Segmentation fault (11) [cn18:24582] Signal code: Address not mapped (1) [cn18:24582] Failing at address: 0x38 [cn18:24582] [ 0] /lib64/libpthread.so.0 [0x305de0dd20] [cn18:24582] [ 1] /usr/local/openmpi-1.4.2/lib/openmpi/mca_mtl_mx.so [0x2d6a7e6d] [cn18:24582] [ 2] /usr/local/openmpi-1.4.2/lib/openmpi/mca_pml_cm.so [0x2d4a319d] [cn18:24582] [ 3] /usr/local/openmpi/lib/libmpi.so.0(ompi_dpm_base_disconnect_init+0xbf) [0x2ab1403f] [cn18:24582] [ 4] /usr/local/openmpi-1.4.2/lib/openmpi/mca_dpm_orte.so [0x2ed0eb19] [cn18:24582] [ 5] /usr/local/openmpi/lib/libmpi.so.0(PMPI_Comm_disconnect+0xa0) [0x2aaf4f20] [cn18:24582] [ 6] ./simpleserver(main+0x14c) [0x400d04] [cn18:24582] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x305ce1daa4] [cn18:24582] [ 8] ./simpleserver [0x400b09] [cn18:24582] *** End of error message *** -- mpiexec noticed that process rank 0 with PID 24582 on node cn18 exited on signal 11 (Segmentation fault). -- [audet@fn1 bench]$ And the client stops with the following error message: -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[31386,1],0]) is on host: linux15 Process 2 ([[54152,1],0]) is on host: cn18 BTLs attempted: self sm tcp Your MPI job is now going to abort; sorry. -- MPI_Comm_connect() sucessful... Error in comm_disconnect_waitall [audet@linux15 mpi]$ I really don't understand this message because the client can connect with the server using tcp on Ethernet. Moreover if I add MCA options when I start the server to include TCP BTL, the same problems happens (the argument list then becomes: '--mca mtl mx --mca pml cm --mca btl tcp,shared,self' ). However if I remove all MCA options when I start the server (e.g. when BTL MX is used), no such problems appears. Everything goes fine also if I start the server with an explicit request to use BTL MX and TCP (e.g. with options '--mca btl mx,tcp,sm,self'). Four running our server application we really prefer to use MX MTL over MX BTL since it is much faster with MTL (although the usual ping pong test is only slightly faster with MTL). Enclosed also the output of ompi_info --all runned on the cluster node (cn18) and the workstation (linux15). Please help me. I think my problem is only a question of wrong MCA parameters (which is obscure for me). Thanks, Martin Audet, Research Officer Industrial Material Institute National Research Council of Canada 75 de Mortagne, Boucherville, QC, J4B 6Y4, Canada
Re: [OMPI users] Memchecker report on v1.3b2 (includes potential bug reports)
4) Well, this sounds reasonable, but according to the MPI-1 standard (see page 40 for non-blocking send/recv, a more detailed explanation in page 30): "A nonblocking send call indicates that the system may start copying data out of the send buffer. The sender should */not access*/ any part of the send buffer after a nonblocking send operation is called, until the send completes." So before calling MPI_Wait to complete an isend operation, any access to the send buffer is illegal. It might be a little strict, but we have to do what the standard says. >> >> This have been changed in the new version of the MPI standard (2.1). >> There is no restriction anymore regarding the read operations on the >> buffers used for non-blocking sends. >Do you mean the next coming version of MPI standard? Because checking >again standard 2.1 , I didn't see any changes of those paragraphs. See >MPI Standard 2.1 (PDF version), page 52, and page 41. The (non modifying) access to a send buffer was agreed for MPI Standard 2.2 not version 2.1 see the MPI 2.2 Wiki: https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/MpiTwoTwoWikiPage https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/45 Martin
Re: [OMPI users] Memory question and possible bug in 64bit addressing under Leopard!
This has nothing to do with the segmentation fault you got but in addition to Brian comment, I would suggest you to not forget that with ISO C++ (the C++98 standard and the upcoming C++0x) a constant expression known at compile time is needed for dimensions of local arrays. In other words, a construct like: int n = 1000; float X[n]; isn't standard compliant because n isn't a constant expression. It compile only because it is a g++ extension (try this with Visual C++ for example). A construct like: const int n = 1000; float X[n]; however is standard compliant since n is a constant expression known at compile time. Variable length arrays would allow setting dimensions of local arrays using any integral expression (whether or not it is constant or known at compile time). This feature was added to the ISO C language in the C99 standard but not in C++. Martin -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Brian Barrett Sent: April 25, 2008 16:11 To: Open MPI Users Subject: Re: [OMPI users] Memory question and possible bug in 64bit addressing under Leopard! On Apr 25, 2008, at 2:06 PM, Gregory John Orris wrote: > produces a core dump on a machine with 12Gb of RAM. > > and the error message > > mpiexec noticed that job rank 0 with PID 75545 on node mymachine.com > exited on signal 4 (Illegal instruction). > > However, substituting in > > float *X = new float[n]; > for > float X[n]; > > Succeeds! You're running off the end of the stack, because of the large amount of data you're trying to put there. OS X by default has a tiny stack size, so codes that run on Linux (which defaults to a much larger stack size) sometimes show this problem. Your best bets are either to increase the max stack size or (more portably) just allocate everything on the heap with malloc/new. Hope this helps, Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Problem with MPI_Scatter() on inter-communicator...
Hi, I don't know if it is my sample code or if it is a problem whit MPI_Scatter() on inter-communicator (maybe similar to the problem we found with MPI_Allgather() on inter-communicator a few weeks ago) but a simple program I wrote freeze during its second iteration of a loop doing an MPI_Scatter() over an inter-communicator. For example if I compile as follows: mpicc -Wall scatter_bug.c -o scatter_bug I get no error or warning. Then if a start it with np=2 as follows: mpiexec -n 2 ./scatter_bug it prints: beginning Scatter i_root_group=0 ending Scatter i_root_group=0 beginning Scatter i_root_group=1 and then hang... Note also that if I change the for loop to execute only the MPI_Scatter() of the second iteration (e.g. replacing "i_root_group=0;" by "i_root_group=1;"), it prints: beginning Scatter i_root_group=1 and then hang... The problem therefore seems to be related with the second iteration itself. Please note that this program run fine with mpich2 1.0.7rc2 (ch3:sock device) for many different number of process (np) when the executable is ran with or without valgrind. The OpenMPI version I use is 1.2.6rc3 and was configured as follows: ./configure --prefix=/usr/local/openmpi-1.2.6rc3 --disable-mpi-f77 --disable-mpi-f90 --disable-mpi-cxx --disable-cxx-exceptions --with-io-romio-flags=--with-file-system=ufs+nfs Note also that all process (when using OpenMPI or mpich2) were started on the same machine. Also if you look at source code, you will notice that some arguments to MPI_Scatter() are NULL or 0. This may look strange and problematic when using a normal intra-communicator. However according to the book "MPI - The complete reference" vol 2 about MPI-2, for MPI_Scatter() with an inter-communicator: "The sendbuf, sendcount and sendtype arguments are significant only at the root process. The recvbuf, recvcount, and recvtype arguments are significant only at the processes of the leaf group." If anyone else can have a look at this program and try it it would be helpful. Thanks, Martin #include #include #include int main(int argc, char **argv) { int ret_code = 0; int comm_size, comm_rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &comm_size); MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank); if (comm_size > 1) { MPI_Comm subcomm, intercomm; const int group_id = comm_rank % 2; int i_root_group; /* split process in two groups: even and odd comm_ranks. */ MPI_Comm_split(MPI_COMM_WORLD, group_id, 0, &subcomm); /* The remote leader comm_rank for even and odd groups are respectively: 1 and 0 */ MPI_Intercomm_create(subcomm, 0, MPI_COMM_WORLD, 1-group_id, 0, &intercomm); /* for i_root_group==0 process with comm_rank==0 scatter data to all process with odd comm_rank */ /* for i_root_group==1 process with comm_rank==1 scatter data to all process with even comm_rank */ for (i_root_group=0; i_root_group < 2; i_root_group++) { if (comm_rank == 0) { printf("beginning Scatter i_root_group=%d\n",i_root_group); } if (group_id == i_root_group) { const int is_root = (comm_rank == i_root_group); int *send_buf = NULL; if (is_root) { const int nbr_other = (comm_size+i_root_group)/2; int ii; send_buf = malloc(nbr_other*sizeof(*send_buf)); for (ii=0; ii < nbr_other; ii++) { send_buf[ii] = ii; } } MPI_Scatter(send_buf, 1, MPI_INT, NULL, 0, MPI_INT, (is_root ? MPI_ROOT : MPI_PROC_NULL), intercomm); if (is_root) { free(send_buf); } } else { int an_int; MPI_Scatter(NULL,0, MPI_INT, &an_int, 1, MPI_INT, 0, intercomm); } if (comm_rank == 0) { printf("ending Scatter i_root_group=%d\n",i_root_group); } } MPI_Comm_free(&intercomm); MPI_Comm_free(&subcomm); } else { fprintf(stderr, "%s: error this program must be started np > 1\n", argv[0]); ret_code = 1; } MPI_Finalize(); return ret_code; }
[OMPI users] RE : RE : MPI_Comm_connect() fails
Edgar, I merged the changes you did from -r17848:17849 in the trunk to OpenMPI version 1.2.6rc2 with George's patch and my small examples now work. Martin De : users-boun...@open-mpi.org [users-boun...@open-mpi.org] de la part de Edgar Gabriel [gabr...@cs.uh.edu] Date d'envoi : 17 mars 2008 15:59 À : Open MPI Users Objet : Re: [OMPI users] RE : MPI_Comm_connect() fails already working on it, together with a move_request Thanks Edgar Jeff Squyres wrote: > Edgar -- > > Can you make a patch for the 1.2 series? > > On Mar 17, 2008, at 3:45 PM, Edgar Gabriel wrote: > >> Martin, >> >> I found the problem in the inter-allgather, and fixed it in patch >> 17849. >> The same test using however MPI_Intercomm_create (just to simplify my >> life compared to Connect/Accept) using 2 vs 4 processes in the two >> groups passes for me -- and did fail with the previous version. >> >> >> Thanks >> Edgar >> >> >> Audet, Martin wrote: >>> Hi Jeff, >>> >>> As I said in my last message (see bellow) the patch (or at least >>> the patch I got) don't fixes the problem for me. Whether I apply it >>> over OpenMPI 1.2.5 or 1.2.6rc2, I still get the same problem: >>> >>> The client aborts with a truncation error message while the server >>> freeze when for example the server is started on 3 process and the >>> client on 2 process. >>> >>> Feel free to try yourself the two small client and server programs >>> I posted in my first message. >>> >>> Thanks, >>> >>> Martin >>> >>> >>> Subject: [OMPI users] RE : users Digest, Vol 841, Issue 3 >>> From: Audet, Martin (Martin.Audet_at_[hidden]) >>> Date: 2008-03-13 17:04:25 >>> >>> Hi Georges, >>> >>> Thanks for your patch, but I'm not sure I got it correctly. The >>> patch I got modify a few arguments passed to isend()/irecv()/recv() >>> in coll_basic_allgather.c. Here is the patch I applied: >>> >>> Index: ompi/mca/coll/basic/coll_basic_allgather.c >>> === >>> --- ompi/mca/coll/basic/coll_basic_allgather.c (revision 17814) >>> +++ ompi/mca/coll/basic/coll_basic_allgather.c (working copy) >>> @@ -149,7 +149,7 @@ >>> } >>> >>> /* Do a send-recv between the two root procs. to avoid >>> deadlock */ >>> - err = MCA_PML_CALL(isend(sbuf, scount, sdtype, 0, >>> + err = MCA_PML_CALL(isend(sbuf, scount, sdtype, root, >>> MCA_COLL_BASE_TAG_ALLGATHER, >>> MCA_PML_BASE_SEND_STANDARD, >>> comm, &reqs[rsize])); >>> @@ -157,7 +157,7 @@ >>> return err; >>> } >>> >>> - err = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, 0, >>> + err = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, root, >>> MCA_COLL_BASE_TAG_ALLGATHER, comm, >>> &reqs[0])); >>> if (OMPI_SUCCESS != err) { >>> @@ -186,14 +186,14 @@ >>> return err; >>> } >>> >>> - err = MCA_PML_CALL(isend(rbuf, rsize * rcount, rdtype, 0, >>> + err = MCA_PML_CALL(isend(rbuf, rsize * scount, sdtype, root, >>> MCA_COLL_BASE_TAG_ALLGATHER, >>> MCA_PML_BASE_SEND_STANDARD, comm, >>> &req)); >>> if (OMPI_SUCCESS != err) { >>> goto exit; >>> } >>> >>> - err = MCA_PML_CALL(recv(tmpbuf, size * scount, sdtype, 0, >>> + err = MCA_PML_CALL(recv(tmpbuf, size * rcount, rdtype, root, >>> MCA_COLL_BASE_TAG_ALLGATHER, comm, >>> MPI_STATUS_IGNORE)); >>> if (OMPI_SUCCESS != err) { >>> >>> However with this patch, I still have the problem. Suppose I start >>> the server with three process and the client with two, the clients >>> prints: >>> >>> [audet_at_linux15 dyn_connect]$ mpiexec --universe univ1 -n 2 ./ >>> aclient '0.2.0:2000' >>> intercomm_flag = 1 >>> intercomm_remote_size = 3 >>> rem_rank_tbl[3] = { 0 1 2} >>> [linux15:26114] *** An error occurred in MPI_Allgather >>> [linux15:26114] *** on communicator >>> [linux15:2
Re: [OMPI users] RE : MPI_Comm_connect() fails
Hi Jeff, As I said in my last message (see bellow) the patch (or at least the patch I got) don't fixes the problem for me. Whether I apply it over OpenMPI 1.2.5 or 1.2.6rc2, I still get the same problem: The client aborts with a truncation error message while the server freeze when for example the server is started on 3 process and the client on 2 process. Feel free to try yourself the two small client and server programs I posted in my first message. Thanks, Martin Subject: [OMPI users] RE : users Digest, Vol 841, Issue 3 From: Audet, Martin (Martin.Audet_at_[hidden]) List-Post: users@lists.open-mpi.org Date: 2008-03-13 17:04:25 Hi Georges, Thanks for your patch, but I'm not sure I got it correctly. The patch I got modify a few arguments passed to isend()/irecv()/recv() in coll_basic_allgather.c. Here is the patch I applied: Index: ompi/mca/coll/basic/coll_basic_allgather.c === --- ompi/mca/coll/basic/coll_basic_allgather.c (revision 17814) +++ ompi/mca/coll/basic/coll_basic_allgather.c (working copy) @@ -149,7 +149,7 @@ } /* Do a send-recv between the two root procs. to avoid deadlock */ - err = MCA_PML_CALL(isend(sbuf, scount, sdtype, 0, + err = MCA_PML_CALL(isend(sbuf, scount, sdtype, root, MCA_COLL_BASE_TAG_ALLGATHER, MCA_PML_BASE_SEND_STANDARD, comm, &reqs[rsize])); @@ -157,7 +157,7 @@ return err; } - err = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, 0, + err = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, root, MCA_COLL_BASE_TAG_ALLGATHER, comm, &reqs[0])); if (OMPI_SUCCESS != err) { @@ -186,14 +186,14 @@ return err; } - err = MCA_PML_CALL(isend(rbuf, rsize * rcount, rdtype, 0, + err = MCA_PML_CALL(isend(rbuf, rsize * scount, sdtype, root, MCA_COLL_BASE_TAG_ALLGATHER, MCA_PML_BASE_SEND_STANDARD, comm, &req)); if (OMPI_SUCCESS != err) { goto exit; } - err = MCA_PML_CALL(recv(tmpbuf, size * scount, sdtype, 0, + err = MCA_PML_CALL(recv(tmpbuf, size * rcount, rdtype, root, MCA_COLL_BASE_TAG_ALLGATHER, comm, MPI_STATUS_IGNORE)); if (OMPI_SUCCESS != err) { However with this patch, I still have the problem. Suppose I start the server with three process and the client with two, the clients prints: [audet_at_linux15 dyn_connect]$ mpiexec --universe univ1 -n 2 ./aclient '0.2.0:2000' intercomm_flag = 1 intercomm_remote_size = 3 rem_rank_tbl[3] = { 0 1 2} [linux15:26114] *** An error occurred in MPI_Allgather [linux15:26114] *** on communicator [linux15:26114] *** MPI_ERR_TRUNCATE: message truncated [linux15:26114] *** MPI_ERRORS_ARE_FATAL (goodbye) mpiexec noticed that job rank 0 with PID 26113 on node linux15 exited on signal 15 (Terminated). [audet_at_linux15 dyn_connect]$ and abort. The server on the other side simply hang (as before). Regards, Martin -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: March 14, 2008 19:45 To: Open MPI Users Subject: Re: [OMPI users] RE : MPI_Comm_connect() fails Yes, please let us know if this fixes it. We're working on a 1.2.6 release; we can definitely put this fix in there if it's correct. Thanks! On Mar 13, 2008, at 4:07 PM, George Bosilca wrote: > I dig into the sources and I think you correctly pinpoint the bug. > It seems we have a mismatch between the local and remote sizes in > the inter-communicator allgather in the 1.2 series (which explain > the message truncation error when the local and remote groups have a > different number of processes). Attached to this email you can find > a patch that [hopefully] solve this problem. If you can please test > it and let me know if this solve your problem. > > Thanks, >george. > > > > > On Mar 13, 2008, at 1:11 PM, Audet, Martin wrote: > >> >> Hi, >> >> After re-checking the MPI standard (www.mpi-forum.org and MPI - The >> Complete Reference), I'm more and more convinced that my small >> examples programs establishing a intercommunicator with >> MPI_Comm_Connect()/MPI_Comm_accept() over an MPI port and >> exchanging data over it with MPI_Allgather() is correct. Especially >> calling MPI_Allgather() with recvcount=1 (its third argument) >> instead of the total number of MPI_INT that will be received (e.g. >> intercomm_remote_size in the examples) is both correct and >> consistent with MPI_Allgather() behavior on intracommunicator (e.g. >> "normal" communica
[OMPI users] RE : users Digest, Vol 841, Issue 3
Hi Georges, Thanks for your patch, but I'm not sure I got it correctly. The patch I got modify a few arguments passed to isend()/irecv()/recv() in coll_basic_allgather.c. Here is the patch I applied: Index: ompi/mca/coll/basic/coll_basic_allgather.c === --- ompi/mca/coll/basic/coll_basic_allgather.c (revision 17814) +++ ompi/mca/coll/basic/coll_basic_allgather.c (working copy) @@ -149,7 +149,7 @@ } /* Do a send-recv between the two root procs. to avoid deadlock */ -err = MCA_PML_CALL(isend(sbuf, scount, sdtype, 0, +err = MCA_PML_CALL(isend(sbuf, scount, sdtype, root, MCA_COLL_BASE_TAG_ALLGATHER, MCA_PML_BASE_SEND_STANDARD, comm, &reqs[rsize])); @@ -157,7 +157,7 @@ return err; } -err = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, 0, +err = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, root, MCA_COLL_BASE_TAG_ALLGATHER, comm, &reqs[0])); if (OMPI_SUCCESS != err) { @@ -186,14 +186,14 @@ return err; } -err = MCA_PML_CALL(isend(rbuf, rsize * rcount, rdtype, 0, +err = MCA_PML_CALL(isend(rbuf, rsize * scount, sdtype, root, MCA_COLL_BASE_TAG_ALLGATHER, MCA_PML_BASE_SEND_STANDARD, comm, &req)); if (OMPI_SUCCESS != err) { goto exit; } -err = MCA_PML_CALL(recv(tmpbuf, size * scount, sdtype, 0, +err = MCA_PML_CALL(recv(tmpbuf, size * rcount, rdtype, root, MCA_COLL_BASE_TAG_ALLGATHER, comm, MPI_STATUS_IGNORE)); if (OMPI_SUCCESS != err) { However with this patch, I still have the problem. Suppose I start the server with three process and the client with two, the clients prints: [audet@linux15 dyn_connect]$ mpiexec --universe univ1 -n 2 ./aclient '0.2.0:2000' intercomm_flag = 1 intercomm_remote_size = 3 rem_rank_tbl[3] = { 0 1 2} [linux15:26114] *** An error occurred in MPI_Allgather [linux15:26114] *** on communicator [linux15:26114] *** MPI_ERR_TRUNCATE: message truncated [linux15:26114] *** MPI_ERRORS_ARE_FATAL (goodbye) mpiexec noticed that job rank 0 with PID 26113 on node linux15 exited on signal 15 (Terminated). [audet@linux15 dyn_connect]$ and abort. The server on the other side simply hang (as before). Regards, Martin
[OMPI users] RE : MPI_Comm_connect() fails
Hi, After re-checking the MPI standard (www.mpi-forum.org and MPI - The Complete Reference), I'm more and more convinced that my small examples programs establishing a intercommunicator with MPI_Comm_Connect()/MPI_Comm_accept() over an MPI port and exchanging data over it with MPI_Allgather() is correct. Especially calling MPI_Allgather() with recvcount=1 (its third argument) instead of the total number of MPI_INT that will be received (e.g. intercomm_remote_size in the examples) is both correct and consistent with MPI_Allgather() behavior on intracommunicator (e.g. "normal" communicator). MPI_Allgather(&comm_rank, 1, MPI_INT, rem_rank_tbl, 1, MPI_INT, intercomm); Also the recvbuf argument (the second argument) of MPI_Allgather() in the examples should have a size of intercomm_remote_size (e.g. the size of the remote group), not the sum of the local and remote groups in the client and sever process. The standard says that for all-to-all type of operations over an intercommunicator, the process send and receives data from the remote group only (anyway it is not possible to exchange data with process of the local group over an intercommunicator). So, for me there is no reason for stopping the process with an error message complaining about message truncation. There should be no truncation, sendcount, sendtype, recvcount and recvtype arguments of MPI_Allgather() are correct and consistent. So again for me the OpenMPI behavior with my example look more and more like a bug... Concerning George comment about valgrind and TCP/IP, I totally agree, messages reported by valgrind are only a clue of a bug, especially in this contex, not a proof of bug. Another clue is that my small examples work perfectly with mpich2 ch3:sock. Regards, Martin Audet -- Message: 4 List-Post: users@lists.open-mpi.org Date: Thu, 13 Mar 2008 08:21:51 +0100 From: jody Subject: Re: [OMPI users] RE : MPI_Comm_connect() fails To: "Open MPI Users" Message-ID: <9b0da5ce0803130021l4ead0f91qaf43e4ac7d332...@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 HI I think the recvcount argument you pass to MPI_Allgather should not be 1 but instead the number of MPI_INTs your buffer rem_rank_tbl can contain. As it stands now, you tell MPI_Allgather that it may only receive 1 MPI_INT. Furthermore, i'm not sure, but i think your receive buffer should be large enough to contain messages from *all* processes, and not just from the "far side" Jody . -- Message: 6 List-Post: users@lists.open-mpi.org Date: Thu, 13 Mar 2008 09:06:47 -0500 From: George Bosilca Subject: Re: [OMPI users] RE : MPI_Comm_connect() fails To: Open MPI Users Message-ID: <82e9ff28-fb87-4ffb-a492-dde472d5d...@eecs.utk.edu> Content-Type: text/plain; charset="us-ascii" I am not aware of any problems with the allreduce/allgather. But, we are aware of the problem with valgrind that report non initialized values when used with TCP. It's a long story, but I can guarantee that this should not affect a correct MPI application. george. PS: For those who want to know the details: we have to send a header over TCP which contain some very basic information, including the size of the fragment. Unfortunately, we have a 2 bytes gap in the header. As we never initialize these 2 unused bytes, but we send them over the wire, valgrind correctly detect the non initialized data transfer. On Mar 12, 2008, at 3:58 PM, Audet, Martin wrote: > Hi again, > > Thanks Pak for the link and suggesting to start an "orted" deamon, > by doing so my clients and servers jobs were able to establish an > intercommunicator between them. > > However I modified my programs to perform an MPI_Allgather() of a > single "int" over the new intercommunicator to test communication a > litle bit and I did encountered problems. I am now wondering if > there is a problem in MPI_Allreduce() itself for intercommunicators. > Note that the same program run without problems with mpich2 > (ch3:sock). > > For example if I start orted as follows: > > orted --persistent --seed --scope public --universe univ1 > > and then start the server with three process: > > mpiexec --universe univ1 -n 3 ./aserver > > it prints: > > Server port = '0.2.0:2000' > > Now if I start the client with two process as follow (using the > server port): > > mpiexec --universe univ1 -n 2 ./aclient '0.2.0:2000' > > The server prints: > > intercomm_flag = 1 > intercomm_remote_size = 2 > rem_rank_tbl[2] = { 0 1} > > which is the correct output. The client then prints: > > intercomm_flag = 1 > intercomm_remote_size = 3 > rem_rank_tbl[3] = { 0 1 2} > [linux15:3089
[OMPI users] RE : MPI_Comm_connect() fails
Hi again, Thanks Pak for the link and suggesting to start an "orted" deamon, by doing so my clients and servers jobs were able to establish an intercommunicator between them. However I modified my programs to perform an MPI_Allgather() of a single "int" over the new intercommunicator to test communication a litle bit and I did encountered problems. I am now wondering if there is a problem in MPI_Allreduce() itself for intercommunicators. Note that the same program run without problems with mpich2 (ch3:sock). For example if I start orted as follows: orted --persistent --seed --scope public --universe univ1 and then start the server with three process: mpiexec --universe univ1 -n 3 ./aserver it prints: Server port = '0.2.0:2000' Now if I start the client with two process as follow (using the server port): mpiexec --universe univ1 -n 2 ./aclient '0.2.0:2000' The server prints: intercomm_flag = 1 intercomm_remote_size = 2 rem_rank_tbl[2] = { 0 1} which is the correct output. The client then prints: intercomm_flag = 1 intercomm_remote_size = 3 rem_rank_tbl[3] = { 0 1 2} [linux15:30895] *** An error occurred in MPI_Allgather [linux15:30895] *** on communicator [linux15:30895] *** MPI_ERR_TRUNCATE: message truncated [linux15:30895] *** MPI_ERRORS_ARE_FATAL (goodbye) mpiexec noticed that job rank 0 with PID 30894 on node linux15 exited on signal 15 (Terminated). As you can see the first messages are correct but the client job terminate with an error (and the server hang). After re-reading the documentation about MPI_Allgather() over an intercommunicator, I don't see anything wrong in my simple code. Also if I run the client and server process with valgrind, I get a few messages like: ==29821== Syscall param writev(vector[...]) points to uninitialised byte(s) ==29821==at 0x36235C2130: writev (in /lib64/libc-2.3.5.so) ==29821==by 0x7885583: mca_btl_tcp_frag_send (in /home/publique/openmpi-1.2.5/lib/openmpi/mca_btl_tcp.so) ==29821==by 0x788501B: mca_btl_tcp_endpoint_send (in /home/publique/openmpi-1.2.5/lib/openmpi/mca_btl_tcp.so) ==29821==by 0x7467947: mca_pml_ob1_send_request_start_prepare (in /home/publique/openmpi-1.2.5/lib/openmpi/mca_pml_ob1.so) ==29821==by 0x7461494: mca_pml_ob1_isend (in /home/publique/openmpi-1.2.5/lib/openmpi/mca_pml_ob1.so) ==29821==by 0x798BF9D: mca_coll_basic_allgather_inter (in /home/publique/openmpi-1.2.5/lib/openmpi/mca_coll_basic.so) ==29821==by 0x4A5069C: PMPI_Allgather (in /home/publique/openmpi-1.2.5/lib/libmpi.so.0.0.0) ==29821==by 0x400EED: main (aserver.c:53) ==29821== Address 0x40d6cac is not stack'd, malloc'd or (recently) free'd in both MPI_Allgather() and MPI_Comm_disconnect() calls for client and server with valgrind always reporting that the address in question are "not stack'd, malloc'd or (recently) free'd". So is there a problem with MPI_Allgather() on intercommunicators or am I doing something wrong ? Thanks, Martin /* aserver.c */ #include #include #include #include int main(int argc, char **argv) { int comm_rank,comm_size; char port_name[MPI_MAX_PORT_NAME]; MPI_Comm intercomm; int ok_flag; int intercomm_flag; int intercomm_remote_size; int *rem_rank_tbl; int ii; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank); MPI_Comm_size(MPI_COMM_WORLD, &comm_size); ok_flag = (comm_rank != 0) || (argc == 1); MPI_Bcast(&ok_flag, 1, MPI_INT, 0, MPI_COMM_WORLD); if (!ok_flag) { if (comm_rank == 0) { fprintf(stderr,"Usage: %s\n",argv[0]); } MPI_Abort(MPI_COMM_WORLD, 1); } MPI_Open_port(MPI_INFO_NULL, port_name); if (comm_rank == 0) { printf("Server port = '%s'\n", port_name); } MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm); MPI_Close_port(port_name); MPI_Comm_test_inter(intercomm, &intercomm_flag); if (comm_rank == 0) { printf("intercomm_flag = %d\n", intercomm_flag); } assert(intercomm_flag != 0); MPI_Comm_remote_size(intercomm, &intercomm_remote_size); if (comm_rank == 0) { printf("intercomm_remote_size = %d\n", intercomm_remote_size); } rem_rank_tbl = malloc(intercomm_remote_size*sizeof(*rem_rank_tbl)); MPI_Allgather(&comm_rank, 1, MPI_INT, rem_rank_tbl, 1, MPI_INT, intercomm); if (comm_rank == 0) { printf("rem_rank_tbl[%d] = {", intercomm_remote_size); for (ii=0; ii < intercomm_remote_size; ii++) { printf(" %d", rem_rank_tbl[ii]); } printf("}\n"); } free(rem_rank_tbl); MPI_Comm_disconnect(&intercomm); MPI_Finalize(); return 0; } /* aclient.c */ #include #include #include #include #include int main(int argc, char **argv) { int comm_rank,comm_size; int ok_flag; MPI_Comm intercomm; int intercomm_flag; int
[OMPI users] MPI_Comm_connect() fails.
Hi, I'm experimenting with the MPI-2 functions for supporting the client/server model in MPI (e.g. server and client are independently created MPI jobs establishing an intercommunicator between them at run time, see section 5.4 "Establishing Communication" of the MPI-2 standard document) and it looks like if MPI_Comm_connect() always fail... That is if I compile simple client/server programs as follow (for the source, see bellow): mpicc aserver.c -o aserver mpicc aclient.c -o aclient I first start the server with: mpiexec -n 1 ./aserver it prints: Server port = '0.1.0:2000' and then start the client as follow (and provide the port name printed by the server): mpiexec -n 1 ./aclient '0.1.0:2000' I get the following error with the client (the server continue to run unperturbed): [linux15:27660] [0,1,0] ORTE_ERROR_LOG: Not found in file dss/dss_unpack.c at line 209 [linux15:27660] [0,1,0] ORTE_ERROR_LOG: Not found in file communicator/comm_dyn.c at line 186 [linux15:27660] *** An error occurred in MPI_Comm_connect [linux15:27660] *** on communicator MPI_COMM_WORLD [linux15:27660] *** MPI_ERR_INTERN: internal error [linux15:27660] *** MPI_ERRORS_ARE_FATAL (goodbye) Note that both are started on the same machine (hostname linux15). The same programs seems to work fine with mpich2 (ch3:sock device) so my question is: am I doing something wrong or is there a bug in OpenMPI ? I use OpenMPI version 1.2.5 configured as follow: ./configure --prefix=/usr/local/openmpi-1.2.5 --disable-mpi-f77 --disable-mpi-f90 --disable-mpi-cxx --disable-cxx-exceptions --with-io-romio-flags=--with-file-system=ufs+nfs on a Linux x86_64 machine runing Fedora Core 4. Thanks, Martin Audet /* aserver.c */ #include #include int main(int argc, char **argv) { int comm_rank,comm_size; char port_name[MPI_MAX_PORT_NAME]; MPI_Comm intercomm; int ok_flag; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank); MPI_Comm_size(MPI_COMM_WORLD, &comm_size); ok_flag = (comm_rank != 0) || (argc == 1); MPI_Bcast(&ok_flag, 1, MPI_INT, 0, MPI_COMM_WORLD); if (!ok_flag) { if (comm_rank == 0) { fprintf(stderr,"Usage: %s\n",argv[0]); } MPI_Abort(MPI_COMM_WORLD, 1); } MPI_Open_port(MPI_INFO_NULL, port_name); if (comm_rank == 0) { printf("Server port = '%s'\n", port_name); } MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm); MPI_Close_port(port_name); MPI_Comm_disconnect(&intercomm); MPI_Finalize(); return 0; } /* aclient.c */ #include #include #include int main(int argc, char **argv) { int comm_rank,comm_size; int ok_flag; MPI_Comm intercomm; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank); MPI_Comm_size(MPI_COMM_WORLD, &comm_size); ok_flag = (comm_rank != 0) || ((argc == 2) && argv[1] && (*argv[1] != '\0')); MPI_Bcast(&ok_flag, 1, MPI_INT, 0, MPI_COMM_WORLD); if (!ok_flag) { if (comm_rank == 0) { fprintf(stderr,"Usage: %s mpi_port\n", argv[0]); } MPI_Abort(MPI_COMM_WORLD, 1); } while (MPI_Comm_connect((comm_rank == 0) ? argv[1] : 0, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm) != MPI_SUCCESS) { if (comm_rank == 0) { printf("MPI_Comm_connect() failled, sleeping and retrying...\n"); } sleep(1); } MPI_Comm_disconnect(&intercomm); MPI_Finalize(); return 0; }
Re: [OMPI users] Suggestion: adding OMPI_ versions macros in mpi.h
Thanks Bert for the reply but having these macros in ompi/version.h only if a special option is given to configure is useless for what I would like to enable in OpenMPI with the present suggestion. This is because the whole idea is to make it possible to write portable MPI compliant C/C++ programs that are able to chose to use or not workarounds for eventual bugs in OpenMPI at compile time based on the exact OpenMPI version. Declaring the versions macros I suggested would make it possible to dectect at compilation if the current OpenMPI version is affected by a specific bug and to eventually activate a workaround if possible (or terminate compilation with #error preprocessor directive if no workaround exists). With the help of the existing OPEN_MPI macro these checks could be easilly ignored when using an MPI implantation other than OpenMPI. And this would be very convenient since the application would adjust itself to the OpenMPI implentation without any user intervention. What I am describing is a common practice. I have checks in my code that check for example ROMIO_VERSION or to activate workarounds for known bugs or checks for __GNUC__ or __INTEL_COMPILER to activate features in newer gcc or icc compilers versions (like the "restrict" pointer qualifier). But to do similar things with OpenMPI we need these version OMPI_ macro defined by default in mpi.h. They have to be always defined otherwise the save no burden for users. Regards, Martin > Hello, > > you can build your ompi with --with-devel-headers and use the header > : > > #define OMPI_MAJOR_VERSION 1 > #define OMPI_MINOR_VERSION 1 > #define OMPI_RELEASE_VERSION 4 > #define OMPI_GREEK_VERSION "" > > Bert > > Audet, Martin wrote: > > Hi, > > > > I would like to suggest you to add macros indicating the version of the > > OpenMPI library in the C/C++ header file mpi.h analogous to the > > parameter constants in the Fortran header file: > > > > parameter (OMPI_MAJOR_VERSION=1) > > parameter (OMPI_MINOR_VERSION=1) > > parameter (OMPI_RELEASE_VERSION=4) > > parameter (OMPI_GREEK_VERSION="") > > parameter (OMPI_SVN_VERSION="r13362") > > > > This would be very handy if someone discover a bug XYZ and a workaround > > for it in OpenMPI versions before (and not including) 1.1.4 for example > > and wants his code to be portable on many OpenMPI versions and also on > > other MPI-2 implementations. In this situation he could do something > > like this in a common C header file: > > > > #ifdef OPEN_MPI > > > > /* true iff (x.y.z < u.v.w) */ > > #define DOTTED_LESS_THAN(x,y,z, u,v,w) \ > > (((x) < (u)) || (((x) == (u)) && (((y) < (v)) || (((y) == (v)) && > > ((z) < (w)) > > > > # if DOTTED_LESS_THAN(OMPI_MAJOR_VERSION, OMPI_MINOR_VERSION, > > OMPI_RELEASE_VERSION, 1,1,4) > > # define USE_MPI_WORKAROUND_XYZ > > # endif > > > > #endif > > > > > > And later in the C source code: > > > > #ifdef USE_MPI_WORKAROUND_XYZ > > /* use the workaround */ > > #else > > /* use the normal method */ > > #endif > > > > > > Thanks, > > > > Martin > >
[OMPI users] Suggestion: adding OMPI_ versions macros in mpi.h
Hi, I would like to suggest you to add macros indicating the version of the OpenMPI library in the C/C++ header file mpi.h analogous to the parameter constants in the Fortran header file: parameter (OMPI_MAJOR_VERSION=1) parameter (OMPI_MINOR_VERSION=1) parameter (OMPI_RELEASE_VERSION=4) parameter (OMPI_GREEK_VERSION="") parameter (OMPI_SVN_VERSION="r13362") This would be very handy if someone discover a bug XYZ and a workaround for it in OpenMPI versions before (and not including) 1.1.4 for example and wants his code to be portable on many OpenMPI versions and also on other MPI-2 implementations. In this situation he could do something like this in a common C header file: #ifdef OPEN_MPI /* true iff (x.y.z < u.v.w) */ #define DOTTED_LESS_THAN(x,y,z, u,v,w) \ (((x) < (u)) || (((x) == (u)) && (((y) < (v)) || (((y) == (v)) && ((z) < (w)) # if DOTTED_LESS_THAN(OMPI_MAJOR_VERSION, OMPI_MINOR_VERSION, OMPI_RELEASE_VERSION, 1,1,4) # define USE_MPI_WORKAROUND_XYZ # endif #endif And later in the C source code: #ifdef USE_MPI_WORKAROUND_XYZ /* use the workaround */ #else /* use the normal method */ #endif Thanks, Martin
[OMPI users] mpicc adds an inexitant directory in the include path.
Hi, I use sometimes OpenMPI and it looks like the mpicc wrapper gives gcc an inexistant directory with -I option. If I ask mpicc how it calls gcc it prints the following: [audet@linux15 libdfem]$ mpicc -show gcc -I/usr/local/openmpi-1.1.2/include -I/usr/local/openmpi-1.1.2/include/openmpi -pthread -L/usr/local/openmpi-1.1.2/lib -lmpi -lorte -lopal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl [audet@linux15 libdfem]$ ls /usr/local/openmpi-1.1.2/include /usr/local/openmpi-1.1.2/include/openmpi ls: /usr/local/openmpi-1.1.2/include/openmpi: No such file or directory /usr/local/openmpi-1.1.2/include: mpi.h mpif-common.h mpif-config.h mpif.h [audet@linux15 libdfem]$ The directory 'usr/local/openmpi-1.1.2/include/openmpi' doesn't exist. And this explains the annoying warnings I get when I compile my sources (I like to have no warning): cc1plus: warning: /usr/local/openmpi-1.1.2/include/openmpi: No such file or directory This happens with OpenMPI 1.1.2 configured as follow: ./configure --prefix=/usr/local/openmpi-1.1.2 --disable-mpi-f90 --disable-mpi-cxx --disable-cxx-exceptions --with-io-romio-flags=--with-file-system=ufs+nfs Thanks, Martin Audet
[OMPI users] configure script not hapy with OpenPBS
Hi, When I tried to install OpenMPI on the front node of a cluster using OpenPBS batch system (e.g. --with-tm=/usr/open-pbs argument to configure), it didn't work and I got the error message: --- MCA component pls:tm (m4 configuration macro) checking for MCA component pls:tm compile mode... dso checking tm.h usability... yes checking tm.h presence... yes checking for tm.h... yes looking for library in lib checking for tm_init in -lpbs... no looking for library in lib64 checking for tm_init in -lpbs... no checking tm.h usability... yes checking tm.h presence... yes checking for tm.h... yes looking for library in lib checking for tm_finalize in -ltorque... no looking for library in lib64 checking for tm_finalize in -ltorque... no configure: error: TM support requested but not found. Aborting By looking in the very long configure script I found two typo errors in variable name: "ompi_check_tm_hapy" is set at lines 68164 and 76084 "ompi_check_loadleveler_hapy" is set at line 73086 where the correct names are obviously "ompi_check_tm_happy" and "ompi_check_loadleveler_happy" (e.g. "happy" not "hapy") when looking to the variables used arround. I corrected the variables names but unfortunately it didn't fixed my problem, configure stoped with the same error message (maybe you should also correct it in your "svn" repository since this may be a "latent" bug). I'm now questionning why didn't the configuration script found the 'tm_init' symbol in libpbs.a since the following command: nm /usr/open-pbs/lib/libpbs.a | grep -e '\' -e '\' prints: 0cd0 T tm_finalize 1270 T tm_init Is it possible that on an EM64T Linux system the configure script require that lib/libpbs.a or lib64/libpbs.a be a 64 bit library to be happy (lib64/libpbs.a doesn't exist and lib/libpbs.a is a 32 bit library on our system since the OpenPBS version we use is a bit old (2.3.x) and didn't appear to be 64 bit clean) ? Martin Audet
[OMPI users] MPI_LONG_LONG_INT != MPI_LONG_LONG
Hi, The current and the previous versions of OpenMPI define MPI_LONG_LONG_INT and MPI_LONG_LONG constants as the address of two distinct global variables (&ompi_mpi_long_long_int and &ompi_mpi_long_long respectively) which makes the following expression true: MPI_LONG_LONG_INT != MPI_LONG_LONG. After consulting the MPI standards, I noticed the following: - The optional datatype corresponding to the optional C/C++ "long long" type is MPI_LONG_LONG_INT according to article 3.2.2. "Message data" of the MPI 1.1 standard (www.mpi-forum.org/docs/mpi-11-html/node32.html) and article 10.2. "Defined Constants for C and Fortran" (www.mpi-forum.org/docs/mpi-11-html/node169.html) of the MPI 1.1 standard. - The MPI_LONG_LONG optional datatype appeared for the first time in section 9.5.2. "External Data Representation: ``external32''" of the MPI 2.0 standard (www.mpi-forum.org/docs/mpi-20-html/node200.htm). This paragraph state that with the external32 data representation, this datatype is eight (8) bytes long. - However the previous statement was recognized as an error in the MPI 2.0 errata document (www.mpi-forum.org/docs/errata-20-2.html). The MPI 2.0 document should have used MPI_LONG_LONG_INT instead of MPI_LONG_LONG. It also state the following: In addition, the type MPI_LONG_LONG should be added as an optional type; it is a synonym for MPI_LONG_LONG_INT. This means that the real optional datatype corresponding to the C/C++ "long long" datatype is MPI_LONG_LONG_INT and that since MPI_LONG_LONG was mentioned by mistake in the MPI 2.0 standard document, the MPI_LONG_LONG predefined datatype constant is also accepted as a synonym to MPI_LONG_LONG_INT. We should therefore have MPI_LONG_LONG_INT == MPI_LONG_LONG which is not the case in OpenMPI. So please have a look at this issue. Note that MPICH and MPICH2 implementations satisfy: MPI_LONG_LONG_INT == MPI_LONG_LONG. Regrards, Martin AudetE: martin DOT audet AT imi cnrc-nrc gc ca Research OfficerT: 450-641-5034 Industrial Material Institute / National Research Council 75 de Mortagne, Boucherville, QC, J4B 6Y4, Canada
[OMPI users] Incorrect behavior for attributes attached to MPI_COMM_SELF.
Hi, It looks like there is a problem in OpenMPI 1.0.2 with how MPI_COMM_SELF attributes callback functions are handled by MPI_Finalize(). The following C program register a callback function associated with the MPI_COMM_SELF communicator to be called during the first steps of MPI_Finalize(). As shown in this example, this can be used to make sure that global MPI_Datatype variables associated to global datatypes are freed by calling MPI_Type_free() before program exit (and thus preventing ugly memory leaks/outstanding allocations when run in valgrind for example). This mechanism is used by the library I'm working on as well as by PetSc library. The program works by taking advantage of the MPI-2 Standard Section 4.8 "Allowing User Function at Process Termination". As it says, the MPI_Finalize() function calls the delete callback associated to the MPI_COMM_SELF attribute "before any other part of MPI are affected". It also says that "calling MPI_Finalized() will return false in any of these callback functions". Section 4.9 of the MPI-2 Standard: "Determining Whether MPI Has Finished" moreover says that it can be determined if MPI is active by calling MPI_Finalized(). It also reaffirm that MPI is active in the callback functions invoked by MPI_Finalize(). I think that an "active" MPI library here means that basic MPI functions like MPI_Type_free() can be called. The following small program therefore seems to conform to the MPI standard. However where I run it (compiled with OpenMPI 1.0.2 mpicc), I get the following message: *** An error occurred in MPI_Type_free *** after MPI was finalized *** MPI_ERRORS_ARE_FATAL (goodbye) Note that this program works well with mpich2. Please have a look at this problem. Thanks, Martin Audet #include #include #include static int attr_delete_function(MPI_Comm p_comm, int p_keyval, void *p_attribute_val, void * p_extra_state) { assert(p_attribute_val != NULL); /* Get a reference on the datatype received. */ MPI_Datatype *const cur_datatype = (MPI_Datatype *)(p_attribute_val); /* Free it if non null. */ if (*cur_datatype != MPI_DATATYPE_NULL) { MPI_Type_free(cur_datatype); assert(*cur_datatype == MPI_DATATYPE_NULL); } return MPI_SUCCESS; } /* If p_datatype refer to a non null MPI datatype, this function will register a callback */ /* function to free p_datatype and set it to MPI_DATATYPE_NULL. This callback will be */ /* called during the first steps of the MPI_Finalize() function when the state of the MPI */ /* library still allows MPI functions to be called. This is done by associating an */ /* attribute to the MPI_COMM_SELF communicator as allowed by the MPI 2 standard (section 4.8). */ static void add_type_free_callback(MPI_Datatype *p_datatype) { int keyval; assert(p_datatype != NULL); /* First create the keyval. */ /* No callback function will be called when MPI_COMM_SELF is duplicated */ /* and attr_delete_function() will be called when MPI_COMM_SELF is */ /* freed (e.g. during MPI_Finalize()). */ /* Since many callback can be associated with MPI_COMM_SELF to free many */ /* datatypes, a new keyval has to be created every time. */ MPI_Keyval_create(MPI_NULL_COPY_FN, &attr_delete_function, &keyval, NULL); /* Then associate this keyval to MPI_COMM_SELF and make sure the pointer */ /* to the datatype p_datatype is passed to the callback. */ MPI_Attr_put(MPI_COMM_SELF, keyval, p_datatype); /* Free the keyval because it is no longer needed.*/ MPI_Keyval_free(&keyval); } typedef struct { short ss; int ii; } glb_struct_t; MPI_Datatype glb_dtype = MPI_DATATYPE_NULL; static void calc_glb_dtype(void) { const int NB_MEM = 3; static int len_tbl[3] = { 1, 1, 1}; static MPI_Aint disp_tbl[3] = { offsetof(glb_struct_t, ss), offsetof(glb_struct_t, ii), sizeof(glb_struct_t) }; static MPI_Datatype type_tbl[3] = { MPI_SHORT, MPI_INT, MPI_UB }; MPI_Type_struct(NB_MEM, len_tbl, disp_tbl, type_tbl, &glb_dtype); MPI_Type_commit(&glb_dtype); add_type_free_callback(&glb_dtype); } int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); calc_glb_dtype(); MPI_Finalize(); return 0; }
[O-MPI users] const_cast<>(), Alltoallw() and Spawn_multiple()
Hi, I just tried OpenMPI 1.0.1 and this time I had much less warnings related to the C++ API than I had with version 1.0.0 (I compile with g++ -Wall). I nonetheless looked at the C++ headers and found that those warnings were still related to the use of C-style cast. Some of them were simply casting away the const type qualifier to call the C API MPI functions. Those casts could easily be converted to the const_cast<>() operator specially designed to do this. I however found that some others were simply wrong and leading to faulty operations. Those casts are located in Intracomm::Alltoallw() and Intracomm::Spawn_multiple() methods. In the first method, a pointer to a table of const MPI::Datatype objects is casted into a pointer to a table of MPI_Datatype types and in the second one, a pointer to a table of const MPI::Info objects is casted into a pointer to a table of MPI_Info types. That is it is assumed that the MPI::Datatype and MPI::Info have respectively the same memory layout as the corresponding C types MPI_Datatype and MPI_Info. This assumption is incorrect in both cases even if MPI::Datatype class contains only a single data member of type MPI_Datatype and even if MPI::Info class contains only a single data member of type MPI_Info. It is incorrect because MPI::Datatype and MPI::Info classes are polymorphics. That is each of them contains at least one virtual method. Since polymorphic classes needs to access the virtual methods table (pointer to members and offset to adjust "this"), the C++ compiler needs to insert at least another member. In all the implementation I've seen this is done by adding a member pointing to the virtual table for the exact class (named "__vtbl"). The resulting classes are then larger than they appear (ex: on my IA32 Linux machine sizeof(MPI::Datatype)==8 and sizeof(MPI::Info)==8 even if sizeof(MPI_Datatype)==4 and sizeof(MPI_Info)==4), the memory layout differs and therefore corresponding pointers cannot be converted by simple type casts. A table of MPI::Datatype object have then to be converted into a table of MPI_Datatype by a temporary table and a small loop. The same is true for MPI::Info and MPI_Info. I modified errhandler.h, intracomm.h and intracomm_inln.h files to implement those corrections. As expected it removes the warnings during compilation and should correct the conversion problems in Intracomm::Alltoallw() and Intracomm::Spawn_multiple() methods. Bellow is the difference between my modified version of OpenMPI and the original one. Please consider this patch for your next release. Thanks, Martin Audet, Research Officer E: martin.au...@imi.cnrc-nrc.gc.ca T: 450-641-5034 Indstrial Material Institute, National Research Council 75 de Mortagne, Boucherville, QC J4B 6Y4, Canada diff --recursive --unified openmpi-1.0.1/ompi/mpi/cxx/errhandler.h openmpi-1.0.1ma/ompi/mpi/cxx/errhandler.h --- openmpi-1.0.1/ompi/mpi/cxx/errhandler.h 2005-11-11 14:21:36.0 -0500 +++ openmpi-1.0.1ma/ompi/mpi/cxx/errhandler.h 2005-12-14 15:29:56.0 -0500 @@ -124,7 +124,7 @@ #if ! 0 /* OMPI_ENABLE_MPI_PROFILING */ // $%%@#%# AIX/POE 2.3.0.0 makes us put in this cast here (void)MPI_Errhandler_create((MPI_Handler_function*) &ompi_mpi_cxx_throw_excptn_fctn, - (MPI_Errhandler *) &mpi_errhandler); + const_cast(&mpi_errhandler)); #else pmpi_errhandler.init(); #endif @@ -134,7 +134,7 @@ //this is called from MPI::Finalize inline void free() const { #if ! 0 /* OMPI_ENABLE_MPI_PROFILING */ -(void)MPI_Errhandler_free((MPI_Errhandler *) &mpi_errhandler); +(void)MPI_Errhandler_free(const_cast(&mpi_errhandler)); #else pmpi_errhandler.free(); #endif diff --recursive --unified openmpi-1.0.1/ompi/mpi/cxx/intracomm.h openmpi-1.0.1ma/ompi/mpi/cxx/intracomm.h --- openmpi-1.0.1/ompi/mpi/cxx/intracomm.h 2005-11-11 14:21:36.0 -0500 +++ openmpi-1.0.1ma/ompi/mpi/cxx/intracomm.h2005-12-14 16:09:29.0 -0500 @@ -228,6 +228,10 @@ PMPI::Intracomm pmpi_comm; #endif + // Convert an array of p_nbr Info object into an array of MPI_Info. + // A pointer to the allocated array is returned and must be eventually deleted. + static inline MPI_Info *convert_info_to_mpi_info(int p_nbr, const Info p_info_tbl[]); + public: // JGS see above about friend decls #if ! 0 /* OMPI_ENABLE_MPI_PROFILING */ static Op* current_op; diff --recursive --unified openmpi-1.0.1/ompi/mpi/cxx/intracomm_inln.h openmpi-1.0.1ma/ompi/mpi/cxx/intracomm_inln.h --- openmpi-1.0.1/ompi/mpi/cxx/intracomm_inln.h 2005-11-30 06:06:07.0 -0500 +++ openmpi-1.0.1ma/ompi/mpi/cxx/intracomm_inln.h 2005-12-14 16:09:35.0 -0500 @@ -144,13 +144,26 @@ void *recvbuf, const int recvcounts[], const int rdispls[], const Datatype recvtypes[]) const { + const int comm_size = Get_size(); + MPI_Datatype *const data_type_tbl
[O-MPI users] MPI_Offset and C++ interface
Hi, I just compiled my library with version 1.0 of OpenMPI and I had two problems. First the MPI_Offset datatype is defined as a preprocessor macro as follow in mpi.h: /* Type of MPI_Offset */ #define MPI_Offset long long This generate a syntax error when MPI_Offset is used in C++ for what Stroustrup call a value construction (e.g. type ( expr_list ) c.f. section 6.2 in The C++ programming language). For example the following code: MPI_Offset ofs,size; int nbr; // compute ofs, size and nbr. ofs += MPI_Offset(nbr)*size; cannot compile if MPI_Offset is defined as it is currently. The obvious solution is to define MPI_Offset as a typedef as follow: /* Type of MPI_Offset */ typedef long long MPI_Offset; Note that a similar typedef is used for MPI_Aint: typedef long MPI_Aint; The seccond problem is related to the C++ interface: it uses direct C-style type cast that remove constness. Since ISO/C++ have the const_cast operator especially for this situation, the compiler generates TONS of warnings (I use to compile my code with -Wall and many other warning activated) and this is really annoying. The solution to this problem is to replace C-style cast with const_cast operator. For example the MPI::Comm::Send method defined in openmpi/ompi/mpi/cxx/comm_inln.h as follow: inline void MPI::Comm::Send(const void *buf, int count, const MPI::Datatype & datatype, int dest, int tag) const { (void)MPI_Send((void *)buf, count, datatype, dest, tag, mpi_comm); } becomes: inline void MPI::Comm::Send(const void *buf, int count, const MPI::Datatype & datatype, int dest, int tag) const { (void)MPI_Send(const_cast(buf), count, datatype, dest, tag, mpi_comm); } This fix the annoying warning problem because the const_cast operator is the intended method to remove constness. Martin Audet E: matin.au...@imi.cnrc-nrc.gc.ca Research Officer T: 450-641-5034 Industrial Material Institute National Research Council of Canada 75 de Mortagne, Boucherville, QC, J4B 6Y4