[OMPI users] TCP btl misbehaves if btl_tcp_port_min_v4 is not set.
Hello all, (this _might_ be related to https://svn.open-mpi.org/trac/ompi/ticket/1505) I just compiled and installed 1.3.3 ins a CentOS 5 environment and we noticed the processes would deadlock as soon as they would start using TCP communications. The test program is one that has been running on other clusters for years with no problems. Furthermore, using local cores doesn't deadlock the process whereas forcing inter-node communications (-bynode scheduling), immediately causes the problem. Symptoms: - processes don't crash or die, the use 100% CPU in system space (as opposed to user space) - stracing one of the processes will show it is freewheeling in a polling loop. - executing with --mca btl_base_verbose 30 will show weird port assignments, either they are wrong or should be interpreted as being an offset from the default btl_tcp_port_min_v4 (1024). - The error "mca_btl_tcp_endpoint_complete_connect] connect() to failed: No route to host (113)" _may_ be seen. We noticed it only showed up if we had vmnet interfaces up and running on certain nodes. Note that setting oob_tcp_listen_mode=listen_thread oob_tcp_if_include=eth0 btl_tcp_if_include=eth0 was one of our first reaction to this to no avail. Workaround we found: While keeping the above mentioned MCA parameters, we added btl_tcp_port_min_v4=2000 due to some firewall rules (which we had obviously disabled as part of the trouble shooting process) and noticed everything seemed to start working correctly from here on. This seems to work but I can find no logical explanation as the code seems to be clean in that respect. Some pasting for people searching frantically for a solution: [cluster-srv2:20379] btl: tcp: attempting to connect() to address 10.194.32.113 on port 2052 [cluster-srv2:20381] btl: tcp: attempting to connect() to address 10.194.32.113 on port 3076 [cluster-srv2:20377] btl: tcp: attempting to connect() to address 10.194.32.113 on port 260 [cluster-srv2:20383] btl: tcp: attempting to connect() to address 10.194.32.113 on port 3588 [cluster-srv1:19900] btl: tcp: attempting to connect() to address 10.194.32.117 on port 1540 [cluster-srv2:20377] btl: tcp: attempting to connect() to address 10.194.32.117 on port 2052 [cluster-srv2:20383] btl: tcp: attempting to connect() to address 10.194.32.117 on port 3076 [cluster-srv1:19894] btl: tcp: attempting to connect() to address 10.194.32.117 on port 516 [cluster-srv2:20379] btl: tcp: attempting to connect() to address 10.194.32.117 on port 3588 [cluster-srv1:19898] btl: tcp: attempting to connect() to address 10.194.32.117 on port 1028 [cluster-srv2:20381] btl: tcp: attempting to connect() to address 10.194.32.117 on port 2564 [cluster-srv1:19896] btl: tcp: attempting to connect() to address 10.194.32.117 on port 4 [cluster-srv3:13665] btl: tcp: attempting to connect() to address 10.194.32.115 on port 1028 [cluster-srv3:13663] btl: tcp: attempting to connect() to address 10.194.32.115 on port 4 [cluster-srv2][[44096,1],9][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] [cluster-srv2][[44096,1],13][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 10.194.32.117 failed: No route to host (113) connect() to 10.194.32.117 failed: No route to host (113) [cluster-srv3][[44096,1],20][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 10.194.32.115 failed: No route to host (113) Cheers! Eric Thiboedau
Re: [OMPI users] Can 2 IB HCAs give twice the bandwidth?
Jeff Squyres wrote: On Oct 18, 2008, at 9:19 PM, Mostyn Lewis wrote: Can OpenMPI do like Scali and MVAPICH2 and utilize 2 IB HCAs per machine to approach double the bandwidth on simple tests such as IMB PingPong? Yes. OMPI will automatically (and aggressively) use as many active ports as you have. So you shouldn't need to list devices+ports -- OMPI will simply use all ports that it finds in the active state. If your ports are on physically separate IB networks, then each IB network will require a different subnet ID so that OMPI can compute reachability properly. Does this apply to all fabrics, or, at which level is this implemented in ompi? (ie: multiple GigE nics...but I doubt it applies given the restricted intricacies of the IP implementation) Eric
[OMPI users] Tuned Collective MCA params
Hello all, I am currently profiling a simple case where I replace multiple S/R calls with Allgather calls and it would _seem_ the simple S/R calls are faster. Now, *before* I come to any conclusion on this, one of the pieces I am missing is more details on how /if/when the tuned coll MCA is selected. In other words, can I assume the tuned versions are used by default? I skimmed through the well documented source code but before I can even start to analyze the replacement's impact (in a small cluster), I need to know how and when the tuned coll MCA is used/selected. Thanks, Eric
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Simply to keep track of what's going on: I checked the build environment for openmpi and the system's setting, they were built using gcc 3.4.4 with -Os, which was reputed unstable and problematic with this compiler version. I've asked Prasanna to rebuild using -O2 but this could be a bit lengthy since the entire system (or at least all libs openmpi links to) needs to be rebuilt. Eric Eric Thibodeau wrote: Prasanna, Please send me your /etc/make.conf and the contents of /var/db/pkg/sys-cluster/openmpi-1.2.7/ You can package this with the following command line: tar -cjf data.tbz /etc/make.conf /var/db/pkg/sys-cluster/openmpi-1.2.7/ And simply send me the data.tbz file. Thanks, Eric Prasanna Ranganathan wrote: Hi, I did make sure at the beginning that only eth0 was activated on all the nodes. Nevertheless, I am currently verifying the NIC configuration on all the nodes and making sure things are as expected. While trying different things, I did come across this peculiar error which I had detailed in one of my previous mails in this thread. I am testing the helloWorld program in the following trivial case: mpirun -np 1 -host localhost /main/mpiHelloWorld Which works fine. But, mpirun -np 1 -host localhost --debug-daemons /main/mpiHelloWorld always fails as follows: Daemon [0,0,1] checking in as pid 2059 on host localhost [idx1:02059] [0,0,1] orted: received launch callback idx1 is node 0 of 1 ranks sum to 0 [idx1:02059] [0,0,1] orted_recv_pls: received message from [0,0,0] [idx1:02059] [0,0,1] orted_recv_pls: received exit [idx1:02059] *** Process received signal *** [idx1:02059] Signal: Segmentation fault (11) [idx1:02059] Signal code: (128) [idx1:02059] Failing at address: (nil) [idx1:02059] [ 0] /lib/libpthread.so.0 [0x2afa8c597f30] [idx1:02059] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18) [0x2afa8be8e2a2] [idx1:02059] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70) [0x2afa8be795ac] [idx1:02059] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20) [0x2afa8be7675c] [idx1:02059] [ 4] orted(main+0x8a6) [0x4024ae] [idx1:02059] *** End of error message *** The failure happens with more verbose output when using the -d flag. Does this point to some bug in OpenMPI or am I missing something here? I have attached ompi_info output on this node. Regards, Prasanna. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_sendrecv = MPI_Send+ MPI_RECV ?
Sorry about that, I had misinterpreted your original post as being the pair of send-receive. The example you give below does seem correct indeed, which means you might have to show us the code that doesn't work. Note that I am in no way a Fortran expert, I'm more versed in C. The only hint I'd give a C programmer in this case is "make sure your receiving structures are indeed large enough (ie: you send 3d but eventually receive 4d...did you allocate for 3d or 4d for receiving the converted array...). Eric Enrico Barausse wrote: sorry, I hadn't changed the subject. I'm reposting: Hi I think it's correct. what I want to to is to send a 3d array from the process 1 to process 0 =root): call MPI_Send(toroot,3,MPI_DOUBLE_PRECISION,root,n,MPI_COMM_WORLD in some other part of the code process 0 acts on the 3d array and turns it into a 4d one and sends it back to process 1, which receives it with call MPI_RECV(tonode,4,MPI_DOUBLE_PRECISION,root,n,MPI_COMM_WORLD,status,ierr) in practice, what I do i basically give by this simple code (which doesn't give the segmentation fault unfortunately): a=(/1,2,3,4,5/) call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, id, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) if(numprocs/=2) stop if(id==0) then do k=1,5 a=a+1 call MPI_SEND(a,5,MPI_INTEGER,1,k,MPI_COMM_WORLD,ierr) call MPI_RECV(b,4,MPI_INTEGER,1,k,MPI_COMM_WORLD,status,ierr) end do else do k=1,5 call MPI_RECV(a,5,MPI_INTEGER,0,k,MPI_COMM_WORLD,status,ierr) b=a(1:4) call MPI_SEND(b,4,MPI_INTEGER,0,k,MPI_COMM_WORLD,ierr) end do end if ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_sendrecv = MPI_Send+ MPI_RECV ?
Enrico Barausse wrote: Hello, I apologize in advance if my question is naive, but I started to use open-mpi only one week ago. I have a complicated fortran 90 code which is giving me a segmentation fault (address not mapped). I tracked down the problem to the following lines: call MPI_Send(toroot,3,MPI_DOUBLE_PRECISION,root,n,MPI_COMM_WORLD call MPI_RECV(tonode,4,MPI_DOUBLE_PRECISION,root,n,MPI_COMM_WORLD,status,ierr) Well, for starters, your receive count doesn't match the send count. (4 Vs 3). Is this a typo? the MPI_send is executed by a process (say 1) which sends the array toroot to another process (say 0). Process 0 successfully receives the array toroot (I print out its components and they are correct), does some calculations on it and sends back an array tonode to process 1. Nevertheless, the MPI_Send routine above never returns controls to process 1 (although the array toroot seems to have been transmitted alright) and gives a segmentation fault (Signal code: Address not mapped (1)) Now, if replace the two lines above with call MPI_sendrecv(toroot,3,MPI_DOUBLE_PRECISION,root,n,tonode,4,MPI_DOUBLE_PRECISION,root,n,MPI_COMM_WORLD,status,ierr) I get no errors and the code works perfectly (I tested it vs the serial version from which I started). But, and here is my question, shouldn't MPI_sendrecv be equivalent to MPI_Send followed by MPI_RECV? thank you in advance for helping with this cheers enrico ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Prasanna, Please send me your /etc/make.conf and the contents of /var/db/pkg/sys-cluster/openmpi-1.2.7/ You can package this with the following command line: tar -cjf data.tbz /etc/make.conf /var/db/pkg/sys-cluster/openmpi-1.2.7/ And simply send me the data.tbz file. Thanks, Eric Prasanna Ranganathan wrote: Hi, I did make sure at the beginning that only eth0 was activated on all the nodes. Nevertheless, I am currently verifying the NIC configuration on all the nodes and making sure things are as expected. While trying different things, I did come across this peculiar error which I had detailed in one of my previous mails in this thread. I am testing the helloWorld program in the following trivial case: mpirun -np 1 -host localhost /main/mpiHelloWorld Which works fine. But, mpirun -np 1 -host localhost --debug-daemons /main/mpiHelloWorld always fails as follows: Daemon [0,0,1] checking in as pid 2059 on host localhost [idx1:02059] [0,0,1] orted: received launch callback idx1 is node 0 of 1 ranks sum to 0 [idx1:02059] [0,0,1] orted_recv_pls: received message from [0,0,0] [idx1:02059] [0,0,1] orted_recv_pls: received exit [idx1:02059] *** Process received signal *** [idx1:02059] Signal: Segmentation fault (11) [idx1:02059] Signal code: (128) [idx1:02059] Failing at address: (nil) [idx1:02059] [ 0] /lib/libpthread.so.0 [0x2afa8c597f30] [idx1:02059] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18) [0x2afa8be8e2a2] [idx1:02059] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70) [0x2afa8be795ac] [idx1:02059] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20) [0x2afa8be7675c] [idx1:02059] [ 4] orted(main+0x8a6) [0x4024ae] [idx1:02059] *** End of error message *** The failure happens with more verbose output when using the -d flag. Does this point to some bug in OpenMPI or am I missing something here? I have attached ompi_info output on this node. Regards, Prasanna. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Prasanna, I opened up a bug report to enable a better control over the threading options (http://bugs.gentoo.org/show_bug.cgi?id=237435). In the meanwhile, if your helloWorld isn't too fluffy, could you send it over (off list if you prefer) so I can take a look at it, the Segmentation fault is probably hinting at another problem. Also, could you send the output of ompi_info now that you've recompiled openmpi with USE=-threads, I want to make sure the option went through as I hope it should. Simply attach the file named out.txt after running the following command: ompi_info > out.txt ...RTF files tend to make my eyes cross over ;) Thanks, Eric Prasanna Ranganathan wrote: Hi, I have tried the following to no avail. On 499 machines running openMPI 1.2.7: mpirun -np 499 -bynode -hostfile nodelist /main/mpiHelloWorld ... With different combinations of the following parameters -mca btl_base_verbose 1 -mca btl_base_debug 2 -mca oob_base_verbose 1 -mca oob_tcp_debug 1 -mca oob_tcp_listen_mode listen_thread -mca btl_tcp_endpoint_cache 65536 -mca oob_tcp_peer_retries 120 I still get the No route to Host error messages. Also, I tried with -mca pls_rsh_num_concurrent 499 --debug-daemons and did not get any additional useful debug output other than the error messages. I did notice one strange thing though. The following is always successful (atleast all my attempts) mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld but mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld --debug-daemons prints these error messages at the end from each of the nodes : [idx2:04064] [0,0,1] orted_recv_pls: received message from [0,0,0] [idx2:04064] [0,0,1] orted_recv_pls: received exit [idx2:04064] *** Process received signal *** [idx2:04064] Signal: Segmentation fault (11) [idx2:04064] Signal code: (128) [idx2:04064] Failing at address: (nil) [idx2:04064] [ 0] /lib/libpthread.so.0 [0x2b92cc729f30] [idx2:04064] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18) [0x2b92cc0202a2] [idx2:04064] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70) [0x2b92cc00b5ac] [idx2:04064] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20) [0x2b92cc00875c] [idx2:04064] [ 4] /usr/bin/orted(main+0x8a6) [0x4024ae] [idx2:04064] *** End of error message *** I am not sure if this points to the actual cause for these issues. Is is to do with the openMPI 1.2.7 having posix enabled in the current configuration on these nodes? Thanks again for your continued help. Regards, Prasanna. Message: 2 Date: Thu, 11 Sep 2008 12:16:50 -0400 From: Jeff SquyresSubject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2 To: Open MPI Users Message-ID: <7110e2d0-eb89-4293-a241-8487174b4...@cisco.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes On Sep 10, 2008, at 9:29 PM, Prasanna Ranganathan wrote: I have upgraded to 1.2.7 and am still noticing the issue. FWIW, we didn't change anything with regards to OOB and TCP from 1.2.6 -> 1.2.7, but it's still good to be at the latest version. Try running with this MCA parameter: mpirun --mca oob_tcp_listen_mode listen_thread ... Sorry; I forgot that we did not enable that option by default in the v1.2 series. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Jeff Squyres wrote: On Sep 11, 2008, at 3:27 PM, Eric Thibodeau wrote: Ok, added to the information from the README, I'm thinking none of the 3 configure options have an impact on the said 'threaded TCP listener' and the MCA option you suggested should still work, is this correct? It should default to --with-threads=posix, which you'll need for the threaded listener (it just means that the system supports posix threads). You can either specify that explicitly or trust configure to get it right (you can examine the output of configure to check that it got it right -- but I'm sure it did). On that matter, since we're modifying the package to correct this, how would I go about enabling `oob_tcp_listen_mode listen_thread` by default at compile time? You can't at compile time, sorry. There's just too many MCA parameters for us to offer a configure parameter for each one of them. But you can set the global config file to set this MCA param value by default: http://www.open-mpi.org/faq/?category=tuning#setting-mca-params Thanks, we're adding this as a default parameter to the openmpi package if threads option was selected. Eric
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Jeff Squyres wrote: On Sep 11, 2008, at 2:38 PM, Eric Thibodeau wrote: In short: Which of the 3 options is the one known to be unstable in the following: --enable-mpi-threadsEnable threads for MPI applications (default: disabled) --enable-progress-threads Enable threads asynchronous communication progress (default: disabled) --with-threads Set thread type (solaris / posix) You shouldn't need to specify any of these. In long (rationale): Just to make sure we don't contradict each other, you're suggesting the use of 'listen_thread' but, at the same time I'm telling Prasanna to _disable_ threads the threads USE flag which translates into the following logic (in the package): Heh; yes, it's a bit confusing -- I apologize. Don't, I forgot about the README which is more explicit about the options and the fact that --with-threads=x was directly linked to the 2 other options; my bad. The "threads" that I'm saying don't work is the MPI multi-threaded support (i.e., MPI_THREAD_MULTIPLE) and support for progress threads within MPI's progression engine. What *does* work is a tiny threaded TCP listener for incoming connections. Since the processing for each TCP connection takes a little time, we found that for scalability reasons, it was good to have a tiny thread that does nothing but block on TCP accept(), get the connection, and then hand it off to the main back-end thread for processing. This allows our accept() rate to be quite high, even if the actual processing is slower. *This* is the "listen_thread" mode, and turns out to be quite necessary for running at scale because our initial wireup coordination occurs over TCP -- there's a flood of incoming TCP connections back to the starter. With the threaded TCP listener, the accept rate is high enough to not cause timeouts for the incoming TCP flood. Ok, added to the information from the README, I'm thinking none of the 3 configure options have an impact on the said 'threaded TCP listener' and the MCA option you suggested should still work, is this correct? Hope that made sense... It did, I just want to make sure we're not disabling the listener thread. On that matter, since we're modifying the package to correct this, how would I go about enabling `oob_tcp_listen_mode listen_thread` by default at compile time? Many thanks, Eric
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Jeff, In short: Which of the 3 options is the one known to be unstable in the following: --enable-mpi-threadsEnable threads for MPI applications (default: disabled) --enable-progress-threads Enable threads asynchronous communication progress (default: disabled) --with-threads Set thread type (solaris / posix) ? In long (rationale): Just to make sure we don't contradict each other, you're suggesting the use of 'listen_thread' but, at the same time I'm telling Prasanna to _disable_ threads the threads USE flag which translates into the following logic (in the package): if use threads; then myconf="${myconf} --enable-mpi-threads --with-progress-threads --with-threads=posix" fi The decision was made based on the configure --help information (most probably from the 1.1 series), which lead to arbitrarily enabling/disabling all that has to do with threads using a single keyword. Now, based on : https://svn.open-mpi.org/trac/ompi/wiki/ThreadSafetySupport So, is it only --enable-mpi-threads that is unstable in the "*thread*" options? Thanks, Eric Jeff Squyres wrote: On Sep 10, 2008, at 9:29 PM, Prasanna Ranganathan wrote: I have upgraded to 1.2.7 and am still noticing the issue. FWIW, we didn't change anything with regards to OOB and TCP from 1.2.6 -> 1.2.7, but it's still good to be at the latest version. Try running with this MCA parameter: mpirun --mca oob_tcp_listen_mode listen_thread ... Sorry; I forgot that we did not enable that option by default in the v1.2 series.
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Jeff Squyres wrote: I'm not sure what USE=-threads means, but I would discourage the use of threads in the v1.2 series; our thread support is pretty much broken in the 1.2 series. That's exactly what it means, hence the following BFW I had originally inserted in the package to this effect: ewarn ewarn "WARNING: use of threads is still disabled by default in" ewarn "upstream builds." ewarn "You may stop now and set USE=-threads" ewarn epause 5 ...ok, so it's maybe not that B and F but it's still there to be noticed and logged ;) On Sep 10, 2008, at 7:52 PM, Eric Thibodeau wrote: Prasanna, also make sure you try with USE=-threads ...as the ebuild states, it's _experimental_ ;) Keep your eye on: https://svn.open-mpi.org/trac/ompi/wiki/ThreadSafetySupport Eric Prasanna Ranganathan wrote: Hi, I have upgraded my openMPI to 1.2.6 (We have gentoo and emerge showed 1.2.6-r1 to be the latest stable version of openMPI). I do still get the following error message when running my test helloWorld program: [10.12.77.21][0,1,95][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c onnect] connect() failed with errno=113[10.12.16.13][0,1,408][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_ complete_connect] connect() failed with errno=113 [10.12.77.15][0,1,89][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c onnect] connect() failed with errno=113 [10.12.77.22][0,1,96][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c onnect] connect() failed with errno=113 Again, this error does not happen with every run of the test program and occurs only certain times. How do I take care of this? Regards, Prasanna. On 9/9/08 9:00 AM, "users-requ...@open-mpi.org" <users-requ...@open-mpi.org> wrote: Message: 1 Date: Mon, 8 Sep 2008 16:43:33 -0400 From: Jeff Squyres <jsquy...@cisco.com> Subject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2 To: Open MPI Users <us...@open-mpi.org> Message-ID: <af302d68-0d30-469e-afd3-566ff9628...@cisco.com> Content-Type: text/plain; charset=WINDOWS-1252; format=flowed; delsp=yes Are you able to upgrade to Open MPI v1.2.7? There were *many* bug fixes and changes in the 1.2 series compared to the 1.1 series, some, in particular, were dealing with TCP socket timeouts (which are important when dealing with large numbers of MPI processes). On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote: Hi, I am trying to run a test mpiHelloWorld program that simply initializes the MPI environment on all the nodes, prints the hostname and rank of each node in the MPI process group and exits. I am using MPI 1.1.2 and am running 997 processes on 499 nodes (Nodes have 2 dual core CPUs). I get the following error messages when I run my program as follows: mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld . . . [0,1,380][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] [0,1,142] [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] [0,1,140][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] [0,1,390] [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 connect() failed with errno=113connect() failed with errno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] [0,1,144] [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 [0,1,388][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113[0,1,386][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 [0,1,139][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 connect() failed with errno=113 . . The main thing is that I get these error messages around 3-4 times out of 10 attempts with the rest all completing successfully. I have looked into the FAQs in detail and also checked the tcp btl settings but am not able to figure it out. All the 499 nodes have only eth0 active and I get the error even when I run the following: mpirun -np 997 -bynode ?hostfile nodelist --mca btl_tcp_if_include eth0 /main/mpiHelloWorld I have attached the output of ompi_info ?all. The following is the output of /sbin/ifconfig on the node where I start the mpi process (it is one of the 499 nodes) eth0 Link encap:Ethernet HWaddr 00:03:25:44:8F:D6 inet addr:10.12.1.11 Bcast:10.12.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1978724556 errors:17 dropped:0 overruns:0 frame: 17 TX packets:1767028063 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueue
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Prasanna Ranganathan wrote: Hi Eric, Thanks a lot for the reply. I am currently working on upgrading to 1.2.7 I do not quite follow your directions; What do you refer to when you say say "try with USE=-threads..." I am referring to the USE variable which is used to set global package specificities. If you want to disable threads only for openmpi, edit /etc/portage/package.use and add the following line to it: sys-cluster/openmpi -threads And re-emerge openmpi, this will disable threads. Kindly excuse if it is a silly question and pardon my ignorance :D It is related to using Gentoo, if you're new to it, I suggest you give the documentation a shot: http://www.gentoo.org/doc/en/index.xml?catid=gentoo Regards, Prasanna. Eric
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Prasanna, also make sure you try with USE=-threads ...as the ebuild states, it's _experimental_ ;) Keep your eye on: https://svn.open-mpi.org/trac/ompi/wiki/ThreadSafetySupport Eric Prasanna Ranganathan wrote: Hi, I have upgraded my openMPI to 1.2.6 (We have gentoo and emerge showed 1.2.6-r1 to be the latest stable version of openMPI). I do still get the following error message when running my test helloWorld program: [10.12.77.21][0,1,95][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c onnect] connect() failed with errno=113[10.12.16.13][0,1,408][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_ complete_connect] connect() failed with errno=113 [10.12.77.15][0,1,89][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c onnect] connect() failed with errno=113 [10.12.77.22][0,1,96][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c onnect] connect() failed with errno=113 Again, this error does not happen with every run of the test program and occurs only certain times. How do I take care of this? Regards, Prasanna. On 9/9/08 9:00 AM, "users-requ...@open-mpi.org"wrote: Message: 1 Date: Mon, 8 Sep 2008 16:43:33 -0400 From: Jeff Squyres Subject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2 To: Open MPI Users Message-ID: Content-Type: text/plain; charset=WINDOWS-1252; format=flowed; delsp=yes Are you able to upgrade to Open MPI v1.2.7? There were *many* bug fixes and changes in the 1.2 series compared to the 1.1 series, some, in particular, were dealing with TCP socket timeouts (which are important when dealing with large numbers of MPI processes). On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote: Hi, I am trying to run a test mpiHelloWorld program that simply initializes the MPI environment on all the nodes, prints the hostname and rank of each node in the MPI process group and exits. I am using MPI 1.1.2 and am running 997 processes on 499 nodes (Nodes have 2 dual core CPUs). I get the following error messages when I run my program as follows: mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld . . . [0,1,380][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] [0,1,142] [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] [0,1,140][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] [0,1,390] [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 connect() failed with errno=113connect() failed with errno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] [0,1,144] [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 [0,1,388][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113[0,1,386][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 [0,1,139][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 connect() failed with errno=113 . . The main thing is that I get these error messages around 3-4 times out of 10 attempts with the rest all completing successfully. I have looked into the FAQs in detail and also checked the tcp btl settings but am not able to figure it out. All the 499 nodes have only eth0 active and I get the error even when I run the following: mpirun -np 997 -bynode ?hostfile nodelist --mca btl_tcp_if_include eth0 /main/mpiHelloWorld I have attached the output of ompi_info ?all. The following is the output of /sbin/ifconfig on the node where I start the mpi process (it is one of the 499 nodes) eth0 Link encap:Ethernet HWaddr 00:03:25:44:8F:D6 inet addr:10.12.1.11 Bcast:10.12.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1978724556 errors:17 dropped:0 overruns:0 frame: 17 TX packets:1767028063 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:580938897359 (554026.5 Mb) TX bytes:689318600552 (657385.4 Mb) Interrupt:22 Base address:0xc000 loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:70560 errors:0 dropped:0 overruns:0 frame:0 TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:339687635 (323.9 Mb) TX bytes:339687635 (323.9 Mb) Kindly help. Regards, Prasanna. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Prasanna Ranganathan wrote: Hi, I have upgraded my openMPI to 1.2.6 (We have gentoo and emerge showed 1.2.6-r1 to be the latest stable version of openMPI). Prasanna, do a sync, 1.2.7 is in portage and report back. Eric I do still get the following error message when running my test helloWorld program: [10.12.77.21][0,1,95][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c onnect] connect() failed with errno=113[10.12.16.13][0,1,408][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_ complete_connect] connect() failed with errno=113 [10.12.77.15][0,1,89][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c onnect] connect() failed with errno=113 [10.12.77.22][0,1,96][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c onnect] connect() failed with errno=113 Again, this error does not happen with every run of the test program and occurs only certain times. How do I take care of this? Regards, Prasanna. On 9/9/08 9:00 AM, "users-requ...@open-mpi.org"wrote: Message: 1 Date: Mon, 8 Sep 2008 16:43:33 -0400 From: Jeff Squyres Subject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2 To: Open MPI Users Message-ID: Content-Type: text/plain; charset=WINDOWS-1252; format=flowed; delsp=yes Are you able to upgrade to Open MPI v1.2.7? There were *many* bug fixes and changes in the 1.2 series compared to the 1.1 series, some, in particular, were dealing with TCP socket timeouts (which are important when dealing with large numbers of MPI processes). On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote: Hi, I am trying to run a test mpiHelloWorld program that simply initializes the MPI environment on all the nodes, prints the hostname and rank of each node in the MPI process group and exits. I am using MPI 1.1.2 and am running 997 processes on 499 nodes (Nodes have 2 dual core CPUs). I get the following error messages when I run my program as follows: mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld . . . [0,1,380][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] [0,1,142] [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] [0,1,140][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] [0,1,390] [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 connect() failed with errno=113connect() failed with errno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] [0,1,144] [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 [0,1,388][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113[0,1,386][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 [0,1,139][btl_tcp_endpoint.c: 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 connect() failed with errno=113 . . The main thing is that I get these error messages around 3-4 times out of 10 attempts with the rest all completing successfully. I have looked into the FAQs in detail and also checked the tcp btl settings but am not able to figure it out. All the 499 nodes have only eth0 active and I get the error even when I run the following: mpirun -np 997 -bynode ?hostfile nodelist --mca btl_tcp_if_include eth0 /main/mpiHelloWorld I have attached the output of ompi_info ?all. The following is the output of /sbin/ifconfig on the node where I start the mpi process (it is one of the 499 nodes) eth0 Link encap:Ethernet HWaddr 00:03:25:44:8F:D6 inet addr:10.12.1.11 Bcast:10.12.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1978724556 errors:17 dropped:0 overruns:0 frame: 17 TX packets:1767028063 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:580938897359 (554026.5 Mb) TX bytes:689318600552 (657385.4 Mb) Interrupt:22 Base address:0xc000 loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:70560 errors:0 dropped:0 overruns:0 frame:0 TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:339687635 (323.9 Mb) TX bytes:339687635 (323.9 Mb) Kindly help. Regards, Prasanna. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Configure fails with icc 10.1.008
Jeff, Thanks...at 23h30 coffee is far off... I saw the proper section of the config.log showing exactly that (hello world not working). For everyone else's benefit, ICC (up to 10.1.008) is _not_ compatible with GCC 4.2... (guess I'll have to retro back to 4.1 series...) Eric Jeff Squyres wrote: This is not an Open MPI problem; Open MPI is simply reporting that your C++ compiler is not working. OMPI tests a trivial C++ program that uses the STL to ensure that your C++ program is working. It's essentially: #include int main () { std::string foo = "Hello, world" ; return 0; } You should probably check with Intel support for more details. On Dec 6, 2007, at 11:25 PM, Eric Thibodeau wrote: Hello all, I am unable to get past ./configure as ICC fails on C++ tests (see attached ompi-output.tar.gz). Configure was called without and the with sourcing `/opt/intel/cc/10.1.xxx/bin/iccvars.sh` as per one of the invocation options in icc's doc. I was unable to find the relevant (well..intelligible for me that is ;P ) cause of the failure in config.log. Any help would be appreciated. Thanks, Eric Thibodeau ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Configure fails with icc 10.1.008
Hello all, I am unable to get past ./configure as ICC fails on C++ tests (see attached ompi-output.tar.gz). Configure was called without and the with sourcing `/opt/intel/cc/10.1.xxx/bin/iccvars.sh` as per one of the invocation options in icc's doc. I was unable to find the relevant (well..intelligible for me that is ;P ) cause of the failure in config.log. Any help would be appreciated. Thanks, Eric Thibodeau ompi-output.tar.gz Description: application/gzip
Re: [OMPI users] Performance of MPI_Isend() worse than MPI_Send() and even MPI_Ssend()
George, For completedness's sake, from what I understand here, the only way to get "true" communications and computation overlap is to have and "MPI broker" thread which would take care of all communications in the form of sync MPI calls. It is that thread which you call asynchronously and then let it manage the communications in the back... correct? Eric Le October 15, 2007, George Bosilca a écrit : > Eric, > > No there is no documentation about this on Open MPI. However, what I > described here, is not related to Open MPI, it's a general problem > with most/all MPI libraries. There are multiple scenarios where non > blocking communications can improve the overall performance of a > parallel application. But, in general, the reason is related to > overlapping communications with computations, or communications with > communications. > > The problem is that using non blocking will increase the critical > path compared with blocking, which usually never help at improving > performance. Now I'll explain the real reason behind that. The REAL > problem is that usually a MPI library cannot make progress while the > application is not in an MPI call. Therefore, as soon as the MPI > library return after posting the non-blocking send, no progress is > possible on that send until the user goes back in the MPI library. If > you compare this with the case of a blocking send, there the library > do not return until the data is pushed on the network buffers, i.e. > the library is the one in control until the send is completed. > >Thanks, > george. > > On Oct 15, 2007, at 2:23 PM, Eric Thibodeau wrote: > > > Hello George, > > > > What you're saying here is very interesting. I am presently > > profiling communication patterns for Parallel Genetic Algorithms > > and could not figure out why the async versions tended to be worst > > than the sync counterpart (imho, that was counter-intuitive). What > > you're basically saying here is that the async communications > > actually add some sychronization overhead that can only be > > compensated if the application overlaps computation with the async > > communications? Is there some "official" reference/documentation to > > this behaviour from OpenMPI (I know the MPI standard doesn't define > > the actual implementation of the communications and therefore lets > > the implementer do as he pleases). > > > > Thanks, > > > > Eric > > > > Le October 15, 2007, George Bosilca a écrit : > >> Your conclusion is not necessarily/always true. The MPI_Isend is just > >> the non blocking version of the send operation. As one can imagine, a > >> MPI_Isend + MPI_Wait increase the execution path [inside the MPI > >> library] compared with any blocking point-to-point communication, > >> leading to worst performances. The main interest of the MPI_Isend > >> operation is the possible overlap of computation with communications, > >> or the possible overlap between multiple communications. > >> > >> However, depending on the size of the message this might not be true. > >> For large messages, in order to keep the memory usage on the receiver > >> at a reasonable level, a rendezvous protocol is used. The sender > >> [after sending a small packet] wait until the receiver confirm the > >> message exchange (i.e. the corresponding receive operation has been > >> posted) to send the large data. Using MPI_Isend can lead to longer > >> execution times, as the real transfer will be delayed until the > >> program enter in the next MPI call. > >> > >> In general, using non-blocking operations can improve the performance > >> of the application, if and only if the application is carefully > >> crafted. > >> > >>george. > >> > >> On Oct 14, 2007, at 2:38 PM, Jeremias Spiegel wrote: > >> > >>> Hi, > >>> I'm working with Open-Mpi on an infiniband-cluster and have some > >>> strange > >>> effect when using MPI_Isend(). To my understanding this should > >>> always be > >>> quicker than MPI_Send() and MPI_Ssend(), yet in my program both > >>> MPI_Send() > >>> and MPI_Ssend() reproducably perform quicker than SSend(). Is there > >>> something > >>> obvious I'm missing? > >>> > >>> Regards, > >>> Jeremias > >>> ___ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > > > > > > > > -- > > Eric Thibodeau > > Neural Bucket Solutions Inc. > > T. (514) 736-1436 > > C. (514) 710-0517 > > -- Eric Thibodeau Neural Bucket Solutions Inc. T. (514) 736-1436 C. (514) 710-0517
Re: [OMPI users] "Address not mapped" error on user defined MPI_OP function
hehe...don't we all love it when a problem "fixes" itself. I was missing a line in my Type creation to realigne the elements correctly: // Displacement is RELATIVE to it's first structure element! for(i=2; i >= 0; i--) Displacement[i] -= Displacement[0]; I'm attaching the functionnal code so that others can maybe see this one as an example ;) Le mercredi 4 avril 2007 11:47, Eric Thibodeau a écrit : > Hello all, > > First off, please excuse the attached code as I may be naïve in my > attempts to implement my own MPI_OP. > > I am attempting to create my own MPI_OP to use with MPI_Allreduce. I > have been able to find very little examples off the net of creating MPI_OPs. > My present references are "MPI The complete reference Volume 1 2nd edition" > and some rather good slides I found at > http://www.mpi-hd.mpg.de/personalhomes/stiff/MPI/ . I am attaching my "proof > of concept" code which fails with: > > [kyron:14074] *** Process received signal *** > [kyron:14074] Signal: Segmentation fault (11) > [kyron:14074] Signal code: Address not mapped (1) > [kyron:14074] Failing at address: 0x801da600 > [kyron:14074] [ 0] [0x6ffa6440] > [kyron:14074] [ 1] > /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x700) > [0x6fbb0dd0] > [kyron:14074] [ 2] > /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xb2) > [0x6fbae9a2] > [kyron:14074] [ 3] > /home/kyron/openmpi_i686/lib/libmpi.so.0(PMPI_Allreduce+0x1a6) [0x6ff61e86] > [kyron:14074] [ 4] AllReduceTest(main+0x180) [0x8048ee8] > [kyron:14074] [ 5] /lib/libc.so.6(__libc_start_main+0xe3) [0x6fcbd823] > [kyron:14074] *** End of error message *** > > > Eric Thibodeau > -- Eric Thibodeau Neural Bucket Solutions Inc. #include #include #include #include #define V_LEN 10 //Vector Length #define E_CNT 10 //Element count MPI_Op MPI_MySum; //Custom Sum function MPI_Datatype MPI_MyType;//We need this MPI Datatype to make MPI aware of our custom structure int i,j,true=1; int totalnodes,mynode; typedef struct CustomType_t { float feat[V_LEN]; //Some vector of float float distc; //An independant float value int number; //A counter of a different type } CustomType; CustomType *SharedStruct; void construct_MyType(void){ int i; CustomType p; int BlockLengths[3] = {V_LEN,1,1}; MPI_Aint Displacement[3]; MPI_Datatype types[3] = {MPI_FLOAT, MPI_FLOAT, MPI_INT}; /* Compute relative displacements w/r to the Type's begining address * using portable technique * */ MPI_Get_address([0],[0]); MPI_Get_address( ,[1]); MPI_Get_address( ,[2]); // Displacement is RELATIVE to it's first structure element! for(i=2; i >= 0; i--) Displacement[i] -= Displacement[0]; // It is good practice to include this in case // the compiler pads your data structure /* BlockLengths[3] = 1; types[3] = MPI_UB; Displacement[3] = sizeof(CustomType); */ MPI_Type_create_struct(3, BlockLengths, Displacement, types, _MyType); MPI_Type_commit(_MyType); // important!! return; } void MySum(CustomType *cin, CustomType *cinout, int *len, MPI_Datatype *dptr) { int i,j; // Some sanity check printf("\nIn MySum, Node %d with len=%d\n",mynode,*len); if(*dptr != MPI_MyType) { printf("Invalid datatype\n"); MPI_Abort(MPI_COMM_WORLD, 3); } for(i=0; i < *len; i++) { cinout[i].distc +=cin[i].distc; cinout[i].number+=cin[i].number; for(j=0; j<V_LEN; j++) cinout[i].feat[j]+=cin[i].feat[j]; } } void PrintStruct(void) { //We print the result from all nodes: printf("Node %d has the following in SharedStruct:\n",mynode); for(i=0; i<E_CNT; i++) { printf("D:%2.1f #:%d Vect:",SharedStruct[i].distc,SharedStruct[i].number); for(j=0; j<V_LEN; j++) printf("%f,",SharedStruct[i].feat[j]); printf("\n"); } printf("= Node %d =\n",mynode); } int main(int argc, char *argv[]) { MPI_Init(, ); MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm_rank(MPI_COMM_WORLD, ); // Create the MPI_MyType Type construct_MyType(); // Create the MPI_MySum Operator MPI_Op_create((MPI_User_function*)MySum, true, _MySum); SharedStruct= (CustomType *)malloc(E_CNT * sizeof(CustomType)); //The dist and number part of the structure never get used at the moment... SharedStruct[0].distc=mynode+1.0; SharedStruct[0].number=mynode; for(i=0; i<V_LEN; i++) SharedStruct[0].feat[i]=mynode+i; // To speed up the process we replicate the process using memcpy: for(i=1; i<E_CNT; i++) memcpy((void*)[i],(void*)SharedStruct,sizeof(CustomType)); //Print Before: PrintStruct(); // We add the content of all nodes _on_ all nodes: MPI_Allreduce(MPI_IN_PLACE, SharedStruct, E_CNT, MPI_MyType, MPI_MySum, MPI_COMM_WORLD); //Print After: PrintStruct(); return 0; }
Re: [OMPI users] "Address not mapped" error on user defined MPI_OP function
I completely forgot to mention which version of OpenMPI I am using, I'll gladly post additional info if required : kyron@kyron ~/openmpi-1.2 $ ompi_info |head Open MPI: 1.2 Open MPI SVN revision: r14027 Open RTE: 1.2 Open RTE SVN revision: r14027 OPAL: 1.2 OPAL SVN revision: r14027 Prefix: /home/kyron/openmpi_i686 Configured architecture: i686-pc-linux-gnu Configured by: kyron Configured on: Wed Apr 4 10:21:34 EDT 2007 Le mercredi 4 avril 2007 11:47, Eric Thibodeau a écrit : > Hello all, > > First off, please excuse the attached code as I may be naïve in my > attempts to implement my own MPI_OP. > > I am attempting to create my own MPI_OP to use with MPI_Allreduce. I > have been able to find very little examples off the net of creating MPI_OPs. > My present references are "MPI The complete reference Volume 1 2nd edition" > and some rather good slides I found at > http://www.mpi-hd.mpg.de/personalhomes/stiff/MPI/ . I am attaching my "proof > of concept" code which fails with: > > [kyron:14074] *** Process received signal *** > [kyron:14074] Signal: Segmentation fault (11) > [kyron:14074] Signal code: Address not mapped (1) > [kyron:14074] Failing at address: 0x801da600 > [kyron:14074] [ 0] [0x6ffa6440] > [kyron:14074] [ 1] > /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x700) > [0x6fbb0dd0] > [kyron:14074] [ 2] > /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xb2) > [0x6fbae9a2] > [kyron:14074] [ 3] > /home/kyron/openmpi_i686/lib/libmpi.so.0(PMPI_Allreduce+0x1a6) [0x6ff61e86] > [kyron:14074] [ 4] AllReduceTest(main+0x180) [0x8048ee8] > [kyron:14074] [ 5] /lib/libc.so.6(__libc_start_main+0xe3) [0x6fcbd823] > [kyron:14074] *** End of error message *** > > > Eric Thibodeau
[OMPI users] "Address not mapped" error on user defined MPI_OP function
Hello all, First off, please excuse the attached code as I may be naïve in my attempts to implement my own MPI_OP. I am attempting to create my own MPI_OP to use with MPI_Allreduce. I have been able to find very little examples off the net of creating MPI_OPs. My present references are "MPI The complete reference Volume 1 2nd edition" and some rather good slides I found at http://www.mpi-hd.mpg.de/personalhomes/stiff/MPI/ . I am attaching my "proof of concept" code which fails with: [kyron:14074] *** Process received signal *** [kyron:14074] Signal: Segmentation fault (11) [kyron:14074] Signal code: Address not mapped (1) [kyron:14074] Failing at address: 0x801da600 [kyron:14074] [ 0] [0x6ffa6440] [kyron:14074] [ 1] /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x700) [0x6fbb0dd0] [kyron:14074] [ 2] /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xb2) [0x6fbae9a2] [kyron:14074] [ 3] /home/kyron/openmpi_i686/lib/libmpi.so.0(PMPI_Allreduce+0x1a6) [0x6ff61e86] [kyron:14074] [ 4] AllReduceTest(main+0x180) [0x8048ee8] [kyron:14074] [ 5] /lib/libc.so.6(__libc_start_main+0xe3) [0x6fcbd823] [kyron:14074] *** End of error message *** Eric Thibodeau #include #include #include #include #define V_LEN 10 //Vector Length #define E_CNT 10 //Element count MPI_Op MPI_MySum; //Custom Sum function MPI_Datatype MPI_MyType;//We need this MPI Datatype to make MPI aware of our custom structure int i,j,true=1; int totalnodes,mynode; typedef struct CustomType_t { float feat[V_LEN]; //Some vector of float float distc; //An independant float value int number; //A counter of a different type } CustomType; CustomType *SharedStruct; void construct_MyType(void){ CustomType p; int BlockLengths[3] = {V_LEN,1,1}; MPI_Aint Displacement[3]; MPI_Datatype types[3] = {MPI_FLOAT, MPI_FLOAT, MPI_INT}; /* Compute relative displacements w/r to the Type's begining address * using portable technique * */ MPI_Get_address([0],[0]); MPI_Get_address( ,[1]); MPI_Get_address( ,[2]); // It is good practice to include this in case // the compiler pads your data structure /* BlockLengths[3] = 1; types[3] = MPI_UB; Displacement[3] = sizeof(CustomType); */ MPI_Type_create_struct(3, BlockLengths, Displacement, types, _MyType); MPI_Type_commit(_MyType); // important!! return; } void MySum(CustomType *cin, CustomType *cinout, int *len, MPI_Datatype *dptr) { int i,j; // Some sanity check printf("\nIn MySum, Node %d with len=\n",mynode,*len); if(*dptr != MPI_MyType) { printf("Invalid datatype\n"); MPI_Abort(MPI_COMM_WORLD, 3); } for(i=0; i < *len; i++) { cinout[i].distc +=cin[i].distc; cinout[i].number+=cin[i].number; for(j=0; j<V_LEN; j++) cinout[i].feat[j]+=cin[i].feat[j]; } } void PrintStruct(void) { //We print the result from all nodes: printf("Node %d has the following in SharedStruct:\n",mynode); for(i=0; i<E_CNT; i++) { printf("D:%2.1f #:%d Vect:",SharedStruct[i].distc,SharedStruct[i].number); for(j=0; j<V_LEN; j++) printf("%f,",SharedStruct[i].feat[j]); printf("\n"); } printf("= Node %d =\n",mynode); } main(int argc, char *argv[]) { MPI_Init(, ); MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm_rank(MPI_COMM_WORLD, ); // Create the MPI_MyType Type construct_MyType(); // Create the MPI_MySum Operator MPI_Op_create((MPI_User_function*)MySum, true, _MySum); SharedStruct= (CustomType *)malloc(E_CNT * sizeof(CustomType)); //The dist and number part of the structure never get used at the moment... SharedStruct[0].distc=mynode+1.0; SharedStruct[0].number=mynode; for(i=0; i<V_LEN; i++) SharedStruct[0].feat[i]=mynode+i; // To speed up the process we replicate the process using memcpy: for(i=1; i<E_CNT; i++) memcpy((void*)[i],(void*)SharedStruct,sizeof(CustomType)); //Print Before: PrintStruct(); // We add the content of all nodes _on_ all nodes: MPI_Allreduce(MPI_IN_PLACE, SharedStruct, E_CNT, MPI_MyType, MPI_MySum, MPI_COMM_WORLD); //Print After: PrintStruct(); }
Re: [OMPI users] Compiling HPCC with OpenMPI
Hi Jeff, I had noticed the the library name switched but thanks for pointing it out still ;) As for the compilation route, I chose to use mpicc as the preferred approach and indeed let the wrapper do the work. FWIW, I got HPCC running, now to find a nice way to sort through all the data ;) Eric Le lundi 26 février 2007 06:53, Jeff Squyres a écrit : > Note that George listed the v1.2 OMPI libraries (-lopen-rte and - > lopenpal) -- the v.1.1.x names are slightly different (-lorte and - > lopal). We had to change the back-end library names between v1.1 and > v1.2 because someone else out in the Linux community uses "libopal". > > I typically prefer using "mpicc" as CC and LINKER and therefore > letting the OMPI wrapper handle everything for exactly this reason. > > > On Feb 21, 2007, at 12:39 PM, Eric Thibodeau wrote: > > > Hi George, > > > > Would you say this is preferred to changing the default CC + LINKER? > > Eric > > Le mercredi 21 février 2007 12:04, George Bosilca a écrit : > >> You should use something like this > >> MPdir = /usr/local/mpi > >> MPinc = -I$(MPdir)/include > >> MPlib = -L$(MPdir)/lib -lmpi -lopen-rte -lopen-pal > >> > >>george. > > > > _______ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Eric Thibodeau Neural Bucket Solutions Inc. T. (514) 736-1436 C. (514) 710-0517
Re: [OMPI users] Compiling HPCC with OpenMPI
Thanks Laurent, I will try your proposed settings. Note that I didn't want to use CC= and LINKER= since I dont know the probable impacts on the rest of the benchmarks...hmm...though this IS a clustering benchamrk. Also note that I wasn't trying to compile for MPICH, I merely copied the lines from a "clean" config as a reference ;) Eric Le mercredi 21 février 2007 11:48, Laurent Nguyen a écrit : > Hello, > > I believe that you are trying to use mpich, not openmpi (libmpich.a). > Personnally, I've compiling HPCC on AIX IBM with OpenMPI with theses lines: > > # -- > # - Message Passing library (MPI) -- > # -- > # MPinc tells the C compiler where to find the Message Passing library > # header files, MPlib is defined to be the name of the library to be > # used. The variable MPdir is only used for defining MPinc and MPlib. > # > MPdir= > MPinc= > MPlib= > ... > CC = mpicc > > LINKER = mpicc > > But, in my environnment variable $PATH, I've the directory where OpenMPI > executables are: //openmpi/bin > > I hope I could help you... > > Regards > > > ** > NGUYEN Anh-Khai Laurent - Ingénieur de Recherche > Equipe Support Utilisateur > > Email:laurent.ngu...@idris.fr > Tél :01.69.35.85.66 > Adresse :IDRIS - Institut du Développement et des Ressources en >Informatique Scientifique >CNRS >Batiment 506 >BP 167 >F - 91403 ORSAY Cedex > Site Web :http://www.idris.fr > ** > > Eric Thibodeau a écrit : > > Hello all, > > > > As we all know, compiling OpenMPI is not a matter of adding -lmpi > > (http://www.open-mpi.org/faq/?category=mpi-apps). I have tried many > > different approaches on configuring the 3 crucial MPI lines in the HPCC > > Makefiles with no success. There seems to be no correct way to get mpicc > > --shome:* to return the correct info and forcing the correct paths/info > > seems to be incorrect (ie, what OpenMPI lib do I point to here: MPlib = > > $(MPdir)/lib/libmpich.a) > > > > Any help would be greatly appreciated! > > > > Exerp from the Makefile: > > > > # -- > > # - Message Passing library (MPI) -- > > # -- > > # MPinc tells the C compiler where to find the Message Passing library > > # header files, MPlib is defined to be the name of the library to be > > # used. The variable MPdir is only used for defining MPinc and MPlib. > > # > > MPdir= /usr/local/mpi > > MPinc= -I$(MPdir)/include > > MPlib= $(MPdir)/lib/libmpich.a > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Eric Thibodeau Neural Bucket Solutions Inc. T. (514) 736-1436 C. (514) 710-0517
Re: [OMPI users] compiling mpptest using OpenMPI
Hi Jeff, I just tried with 1.2b4r13690 and the problem is still present. Only nottable differance is that CTRL-C gave me orterun: killing job... but stuck there untill I hit CTRL-\..if it has any bearing on the issue. Again, the command line was: orterun -np 11 ./perftest-1.3c/mpptest -max_run_time 1800 -bisect -size 0 4096 1 -gnuplot -fname HyperTransport/Global_bisect_0_4096_1.gpl (only difference is that I had 11 procs instead of 9 available) Le vendredi 16 février 2007 06:50, Jeff Squyres a écrit : > Could you try one of the later nightly 1.2 tarballs? We just fixed a > shared memory race condition, for example: > > http://www.open-mpi.org/nightly/v1.2/ > > > On Feb 16, 2007, at 12:12 AM, Eric Thibodeau wrote: > > > Hello devs, > > > > Thought I would let you know there seems to be a problem with > > 1.2b3r13112 when running the "bisection" test on a Tyan VX50 > > machine (the 8 DualCore model with 32Gigs of RAM). > > > > OpenMPI was compiled with (as seen from config.log): > > configure:116866: running /bin/sh './configure' CFLAGS="-O3 - > > DNDEBUG -finline-functions -fno-strict-aliasing -pthread" > > CPPFLAGS=" " FFLAGS="" LDFLAGS=" " --enable-shared --disable- > > static --prefix=/export/livia/home/parallel/eric/openmpi_x86_64 -- > > with-mpi=open_mpi --cache-file=/dev/null --srcdir=. > > > > MPPTEST (1.3c) was compiled with: > > ./configure --with-mpi=$HOME/openmpi_`uname -m` > > > > ...which, for some reason, works fine on that system that doesn't > > have any other MPI implementation (ie: doesn't have LAM-MPI as per > > this thread). > > > > Then I ran a few tests but this one ran for over it's allowed time > > (1800 seconds and was going over 50minutes...) and was up to 16Gigs > > of RAM: > > > > orterun -np 9 ./perftest-1.3c/mpptest -max_run_time 1800 -bisect - > > size 0 4096 1 -gnuplot -fname HyperTransport/ > > Global_bisect_0_4096_1.gpl > > > > I had to CTRL-\ the process as CTRL-C wasn't sufficient. 2 mpptest > > processes and 1 orterun process were using 100% CPU ou of of the 16 > > cores. > > > > If any of this can be indicative of an OpenMPI bug and if I can > > help in tracking it down, don't hesitate to ask for details. > > > > And, finally, Anthony, thanks for the MPICC and --with-mpich > > pointers, I will try those to simplify the build process! > > > > Eric > > > > Le jeudi 15 février 2007 19:51, Anthony Chan a écrit : > >> > >> As long as mpicc is working, try configuring mpptest as > >> > >> mpptest/configure MPICC=/bin/mpicc > >> > >> or > >> > >> mpptest/configure --with-mpich= > >> > >> A.Chan > >> > >> On Thu, 15 Feb 2007, Eric Thibodeau wrote: > >> > >>> Hi Jeff, > >>> > >>> Thanks for your response, I eventually figured it out, here is the > >>> only way I got mpptest to compile: > >>> > >>> export LD_LIBRARY_PATH="$HOME/openmpi_`uname -m`/lib" > >>> CC="$HOME/openmpi_`uname -m`/bin/mpicc" ./configure --with- > >>> mpi="$HOME/openmpi_`uname -m`" > >>> > >>> And, yes I know I should use the mpicc wrapper and all (I do > >>> RTFM :P ) but > >>> mpptest is less than cooperative and hasn't been updated lately > >>> AFAIK. > >>> > >>> I'll keep you posted on some results as I get some results out > >>> (testing > >>> TCP/IP as well as the HyperTransport on a Tyan Beast). Up to now, > >>> LAM-MPI > >>> seems less efficient at async communications and shows no > >>> improovments > >>> with persistant communications under TCP/IP. OpenMPI, on the > >>> other hand, > >>> seems more efficient using persistant communications when in a > >>> HyperTransport (shmem) environment... I know I am crossing many test > >>> boudaries but I will post some PNGs of my results (as well as how > >>> I got to > >>> them ;) > >>> > >>> Eric > >>> > >>> On Thu, 15 Feb 2007, Jeff Squyres wrote: > >>> > >>>> I think you want to add $HOME/openmpi_`uname -m`/lib to your > >>>> LD_LIBRARY_PATH. This should allow executables created by mpicc > >>>> (or > >>>> any derivation thereof, such as extracting flags vi
Re: [OMPI users] compiling mpptest using OpenMPI
Hello devs, Thought I would let you know there seems to be a problem with 1.2b3r13112 when running the "bisection" test on a Tyan VX50 machine (the 8 DualCore model with 32Gigs of RAM). OpenMPI was compiled with (as seen from config.log): configure:116866: running /bin/sh './configure' CFLAGS="-O3 -DNDEBUG -finline-functions -fno-strict-aliasing -pthread" CPPFLAGS=" " FFLAGS="" LDFLAGS=" " --enable-shared --disable-static --prefix=/export/livia/home/parallel/eric/openmpi_x86_64 --with-mpi=open_mpi --cache-file=/dev/null --srcdir=. MPPTEST (1.3c) was compiled with: ./configure --with-mpi=$HOME/openmpi_`uname -m` ...which, for some reason, works fine on that system that doesn't have any other MPI implementation (ie: doesn't have LAM-MPI as per this thread). Then I ran a few tests but this one ran for over it's allowed time (1800 seconds and was going over 50minutes...) and was up to 16Gigs of RAM: orterun -np 9 ./perftest-1.3c/mpptest -max_run_time 1800 -bisect -size 0 4096 1 -gnuplot -fname HyperTransport/Global_bisect_0_4096_1.gpl I had to CTRL-\ the process as CTRL-C wasn't sufficient. 2 mpptest processes and 1 orterun process were using 100% CPU ou of of the 16 cores. If any of this can be indicative of an OpenMPI bug and if I can help in tracking it down, don't hesitate to ask for details. And, finally, Anthony, thanks for the MPICC and --with-mpich pointers, I will try those to simplify the build process! Eric Le jeudi 15 février 2007 19:51, Anthony Chan a écrit : > > As long as mpicc is working, try configuring mpptest as > > mpptest/configure MPICC=/bin/mpicc > > or > > mpptest/configure --with-mpich= > > A.Chan > > On Thu, 15 Feb 2007, Eric Thibodeau wrote: > > > Hi Jeff, > > > > Thanks for your response, I eventually figured it out, here is the > > only way I got mpptest to compile: > > > > export LD_LIBRARY_PATH="$HOME/openmpi_`uname -m`/lib" > > CC="$HOME/openmpi_`uname -m`/bin/mpicc" ./configure > > --with-mpi="$HOME/openmpi_`uname -m`" > > > > And, yes I know I should use the mpicc wrapper and all (I do RTFM :P ) but > > mpptest is less than cooperative and hasn't been updated lately AFAIK. > > > > I'll keep you posted on some results as I get some results out (testing > > TCP/IP as well as the HyperTransport on a Tyan Beast). Up to now, LAM-MPI > > seems less efficient at async communications and shows no improovments > > with persistant communications under TCP/IP. OpenMPI, on the other hand, > > seems more efficient using persistant communications when in a > > HyperTransport (shmem) environment... I know I am crossing many test > > boudaries but I will post some PNGs of my results (as well as how I got to > > them ;) > > > > Eric > > > > On Thu, 15 Feb 2007, Jeff Squyres wrote: > > > > > I think you want to add $HOME/openmpi_`uname -m`/lib to your > > > LD_LIBRARY_PATH. This should allow executables created by mpicc (or > > > any derivation thereof, such as extracting flags via showme) to find > > > the Right shared libraries. > > > > > > Let us know if that works for you. > > > > > > FWIW, we do recommend using the wrapper compilers over extracting the > > > flags via --showme whenever possible (it's just simpler and should do > > > what you need). > > > > > > > > > On Feb 15, 2007, at 3:38 PM, Eric Thibodeau wrote: > > > > > > > Hello all, > > > > > > > > > > > > I have been attempting to compile mpptest on my nodes in vain. Here > > > > is my current setup: > > > > > > > > > > > > Openmpi is in "$HOME/openmpi_`uname -m`" which translates to "/ > > > > export/home/eric/openmpi_i686/". I tried the following approaches > > > > (you can see some of these were out of desperation): > > > > > > > > > > > > CFLAGS=`mpicc --showme:compile` LDFLAGS=`mpicc --showme:link` ./ > > > > configure > > > > > > > > > > > > Configure fails on: > > > > > > > > checking whether the C compiler works... configure: error: cannot > > > > run C compiled programs. > > > > > > > > > > > > The log shows that: > > > > > > > > ./a.out: error while loading shared libraries: liborte.so.0: cannot > > > > open shared object file: No such file or directory > > > > > > > > > > > > > > > > CC="/export/home/eric/openmpi_
Re: [OMPI users] x86_64 head with x86 diskless nodes, Node execution fails with SEGV_MAPERR
Thanks, now all makes more sense to me. I'll try the hard way, multiple builds for multiple envs ;) Eric Le dimanche 16 juillet 2006 18:21, Brian Barrett a écrit : > On Jul 16, 2006, at 4:13 PM, Eric Thibodeau wrote: > > Now that I have that out of the way, I'd like to know how I am > > supposed to compile my apps so that they can run on an homogenous > > network with mpi. Here is an example: > > > > kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpicc -L/ > > usr/X/lib -lm -lX11 -O3 mandelbrot-mpi.c -o mandelbrot-mpi > > > > kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpirun -- > > hostfile hostlist -np 3 ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2/ > > mandelbrot-mpi > > > > -- > > > > > > Could not execute the executable "/home/kyron/1_Files/1_ETS/ > > 1_Maitrise/MGL810/Devoir2/mandelbrot-mpi": Exec format error > > > > > > This could mean that your PATH or executable name is wrong, or that > > you do not > > > > have the necessary permissions. Please ensure that the executable > > is able to be > > > > found and executed. > > > > -- > > > > > > As can be seen with the uname -a that was run previously, I have 2 > > "local nodes" on the x86_64 and two i686 nodes. I tried to find > > examples in the Doc on howto compile applications correctly for > > such a setup without compromising performance but I came short of > > an example. > > From the sound of it, you have a heterogeneous configuration -- some > nodes are x86_64 and some are x86. Because of this, you either have > to compile your application twice, once for each platform or compile > your application for the lowest common denominator. My guess would > be that it easier and more foolproof if you compiled everything in 32 > bit mode. If you run in a mixed mode, using application schemas (see > the mpirun man page) will be the easiest way to make things work. > > > Brian > -- Eric Thibodeau Neural Bucket Solutions Inc. T. (514) 736-1436 C. (514) 710-0517
Re: [OMPI users] x86_64 head with x86 diskless nodes, Node execution fails with SEGV_MAPERR
/me blushes in shame, it would seem that all I needed to do since the begining was to run a make distclean. I apprantly had some old compiled files lying around. Now I get: kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpirun --hostfile hostlist -np 4 uname -a Linux headless 2.6.17-ck1-r1 #1 SMP Tue Jul 11 16:39:18 EDT 2006 x86_64 AMD Opteron(tm) Processor 244 GNU/Linux Linux headless 2.6.17-ck1-r1 #1 SMP Tue Jul 11 16:39:18 EDT 2006 x86_64 AMD Opteron(tm) Processor 244 GNU/Linux Linux node0 2.6.16-gentoo-r7 #5 Tue Jul 11 12:30:41 EDT 2006 i686 AMD Athlon(TM) XP 2500+ GNU/Linux Linux node1 2.6.16-gentoo-r7 #5 Tue Jul 11 12:30:41 EDT 2006 i686 AMD Athlon(TM) XP 2500+ GNU/Linux Which is correct. Sorry for the misfire, I hadn't thought of cleaning up the compilation dir... Now that I have that out of the way, I'd like to know how I am supposed to compile my apps so that they can run on an homogenous network with mpi. Here is an example: kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpicc -L/usr/X/lib -lm -lX11 -O3 mandelbrot-mpi.c -o mandelbrot-mpi kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpirun --hostfile hostlist -np 3 ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2/mandelbrot-mpi -- Could not execute the executable "/home/kyron/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2/mandelbrot-mpi": Exec format error This could mean that your PATH or executable name is wrong, or that you do not have the necessary permissions. Please ensure that the executable is able to be found and executed. -- As can be seen with the uname -a that was run previously, I have 2 "local nodes" on the x86_64 and two i686 nodes. I tried to find examples in the Doc on howto compile applications correctly for such a setup without compromising performance but I came short of an example. Thanks, Eric PS: I know..maybe I should start another thread ;) Le dimanche 16 juillet 2006 14:31, Brian Barrett a écrit : > On Jul 15, 2006, at 2:58 PM, Eric Thibodeau wrote: > > But, for some reason, on the Athlon node (in their image on the > > server I should say) OpenMPI still doesn't seem to be built > > correctly since it crashes as follows: > > > > > > kyron@node0 ~ $ mpirun -np 1 uptime > > > > Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) > > > > Failing at addr:(nil) > > > > [0] func:/home/kyron/openmpi_i686/lib/libopal.so.0 [0xb7f6258f] > > > > [1] func:[0xe440] > > > > [2] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init_stage1 > > +0x1d7) [0xb7fa0227] > > > > [3] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_system_init > > +0x23) [0xb7fa3683] > > > > [4] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init+0x5f) > > [0xb7f9ff7f] > > > > [5] func:mpirun(orterun+0x255) [0x804a015] > > > > [6] func:mpirun(main+0x22) [0x8049db6] > > > > [7] func:/lib/tls/libc.so.6(__libc_start_main+0xdb) [0xb7de8f0b] > > > > [8] func:mpirun [0x8049d11] > > > > *** End of error message *** > > > > Segmentation fault > > > > > > The crash happens both in the chrooted env and on the nodes. I > > configured both systems to have Linux and POSIX threads, though I > > see openmpi is calling the POSIX version (a message on the mailling > > list had hinted on keeping the Linux threads around...I have to > > anyways since sone apps like Matlab extensions still depend on > > this...). The following is the output for the libc info. > > That's interesting... We regularly build Open MPI on 32 bit Linux > machines (and in 32 bit mode on Opteron machines) without too much > issue. It looks like we're jumping into a NULL pointer, which > generally means that a ORTE framework failed to initialize itself > properly. It would be useful if you could rebuild with debugging > symbols (just add -g to CFLAGS when configuring) and run mpirun in > gdb. If we can determine where the error is occurring, that would > definitely help in debugging your problem. > > Brian > > -- Eric Thibodeau Neural Bucket Solutions Inc. T. (514) 736-1436 C. (514) 710-0517
[OMPI users] x86_64 head with x86 diskless nodes, Node execution fails with SEGV_MAPERR
Hello all, I've been trying to set up a small test cluster with a dual Opteron head and Athlon nodes. My environment in both cases is Gentoo and the nodes boot off PXE using an image built and stored on the master node. I chroot into the node's environment using: linux32 chroot ${ROOT} /bin/bash To cross over the 64/32bit barrier. My user's home direcory is loop-mounted into that environment and NFS exported to the nodes. I build OpenMPI in the following way: In the build folder of OpenMPI-1.1: ./configure --cache-file=config_`uname -m`.cache --enable-pretty-print-stacktrace --prefix=$HOME/openmpi_`uname -m` make -j4 && make install I perform this exact same command in the Opteron and chrooted environment for the Athlon machines. This then gives me the following folders in my $HOME: /home/kyron/openmpi_i686 /home/kyron/openmpi_x86_64 But, for some reason, on the Athlon node (in their image on the server I should say) OpenMPI still doesn't seem to be built correctly since it crashes as follows: kyron@node0 ~ $ mpirun -np 1 uptime Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:(nil) [0] func:/home/kyron/openmpi_i686/lib/libopal.so.0 [0xb7f6258f] [1] func:[0xe440] [2] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init_stage1+0x1d7) [0xb7fa0227] [3] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_system_init+0x23) [0xb7fa3683] [4] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init+0x5f) [0xb7f9ff7f] [5] func:mpirun(orterun+0x255) [0x804a015] [6] func:mpirun(main+0x22) [0x8049db6] [7] func:/lib/tls/libc.so.6(__libc_start_main+0xdb) [0xb7de8f0b] [8] func:mpirun [0x8049d11] *** End of error message *** Segmentation fault The crash happens both in the chrooted env and on the nodes. I configured both systems to have Linux and POSIX threads, though I see openmpi is calling the POSIX version (a message on the mailling list had hinted on keeping the Linux threads around...I have to anyways since sone apps like Matlab extensions still depend on this...). The following is the output for the libc info. kyron@headless ~ $ /lib/tls/libc.so.6 GNU C Library stable release version 2.3.6, by Roland McGrath et al. Copyright (C) 2005 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 4.1.1 (Gentoo 4.1.1). Compiled on a Linux 2.6.11 system on 2006-07-14. Available extensions: GNU libio by Per Bothner crypt add-on version 2.1 by Michael Glad and others Native POSIX Threads Library by Ulrich Drepper et al The C stubs add-on version 2.1.2. GNU Libidn by Simon Josefsson BIND-8.2.3-T5B NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk Thread-local storage support included. For bug reporting instructions, please see: <http://www.gnu.org/software/libc/bugs.html>. I am attaching the config.log and ompi_info for both platforms. Before sending this e-mail I tried compiling OpenMPI on one of the nodes (booted off the image) and I am getting the exact same problem (so chroot vs local build doesn't seem to be a factor). The attached file contains: config.log.x86_64 <--config log for the Opteron build (works locally) config.log_node0<--config log for the Athlon build (on the node) ompi_info.i686 <--ompi_info on the Athlon node ompi_info.x86_64<--ompi_info on the Opteron Master Thanks, -- Eric Thibodeau Neural Bucket Solutions Inc. T. (514) 736-1436 C. (514) 710-0517 ENV_info.tbz Description: application/tbz
Re: [OMPI users] Tutorial
www.clustermonkey.net is a very good place to start, click on the "Columns" section in the "Main Menu" in the left pane. Le mardi 11 juillet 2006 07:25, Tony Power a écrit : > Hi! > Where can I find a introductory tutorial on open-mpi? > Thank you ;) > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Eric Thibodeau Neural Bucket Solutions Inc. T. (514) 736-1436 C. (514) 710-0517
Re: [OMPI users] MPI_Recv, is it possible to switch on/off aggresive mode during runtime?
Although it will add some overhead, have you tried using MPI_Probe before calling MPI_Recv. I am curious to know if the Probe is less CPU intensive than a direct call to MPI_Recv. An example of how I use it: MPI_Probe(MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,); MPI_Recv(DispBuff,height,MPI_UNSIGNED_LONG,status.MPI_SOURCE,status.MPI_TAG,MPI_COMM_WORLD,); (This is used to receive known data from an unknown source) Eric Le mercredi 5 juillet 2006 10:54, Marcin Skoczylas a écrit : > Dear open-mpi users, > > I saw some posts ago almost the same question as I have, but it didn't > give me satisfactional answer. > I have setup like this: > > GUI program on some machine (f.e. laptop) > Head listening on tcpip socket for commands from GUI. > Workers waiting for commands from Head / processing the data. > > And now it's problematic. For passing the commands from Head I'm using: > while(true) > { > MPI_Recv... > > do whatever head said (process small portion of the data, return > result to head, wait for another commands) > } > > So in the idle time workers are stuck in MPI_Recv and have 100% CPU > usage, even if they are just waiting for the commands from Head. > Normally, I would not prefer to have this situation as I sometimes have > to share the cluster with others. I would prefer not to stop whole mpi > program, but just go into 'idle' mode, and thus make it run again soon. > Also I would like to have this aggresive MPI_Recv approach switched on > when I'm alone on the cluster. So is it possible somehow to switch this > mode on/off during runtime? Thank you in advance! > > greetings, Marcin > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Eric Thibodeau Neural Bucket Solutions Inc. T. (514) 736-1436 C. (514) 710-0517
Re: [OMPI users] Can I install OpenMPI on a machine where I have mpich2
The only reference I have at the moment (technical article in french). http://www.manitou.uqam.ca/manitou.dll?lire+recherche+_DEFAUT+format+html+expression+%2340786002:2 I strongly recommend scanning IEEE on the subject though and cheching out the beowulf mailling list. Eric Le lundi 3 juillet 2006 23:40, Manal Helal a écrit : > Hi Eric > > Thank you very much for your reply. > > I am a PhD student, and I do need this comparison for academic purposes; > a fairly generic one will do, and I guess after running on both, I might > have my own application/hardware specific points to add, > > Thanks again, I appreciate it, > > Manal > > On Mon, 2006-07-03 at 23:17 -0400, Eric Thibodeau wrote: > > See comments below: > > > > Le lundi 3 juillet 2006 23:01, Manal Helal a écrit : > > > Hi > > > > > > I am having problems running a multi-threaded applications using MPICH > > > 2, and considering moving to OpenMPI. I already have mpich2 installed, > > > and don't want to uninstall as yet. Can I have both installed and works > > > fine on the same machine? > > Yes, simply run the configure script with something like: > > > > ./configure --prefix=$HOME/openmpi-`uname -m` > > > > You will then be able to compile applications with: > > > > ~/openmpi-i686/bin/mpicc app.c -o app > > > > And run them with: > > > > ~/openmpi-i686/bin/mpirun -np 3 app > > > > > Also, I searched for a comparison of features of mpich vs lammpi vs > > > openmpi and didn't find any so far. Will you please help me find one? > > > > Comparison is only relevant on your hardware with you application. Any > > other comparison are mostly for academic purposes and grand assignments ;) > > > > > Thank you for your help in advance, > > > > > > Regards, > > > > > > Manal > > > -- Eric Thibodeau Neural Bucket Solutions Inc. T. (514) 736-1436 C. (514) 710-0517
Re: [OMPI users] Re : OpenMPI 1.1: Signal:10, info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN)
I am actually running the released 1.1. I can send you my code, if you want, and you could try running it off a single node with -np 4 or 5 (oversubscribing) and see if you get a BUS_ADRALN error off one node. The only restriction to compiling the code is that X libs be available (display is not required for the execution though it's more fun :P) Eric Le mercredi 28 juin 2006 13:02, Terry D. Dontje a écrit : > Well, I've been using the trunk and not 1.1. I also just built > 1.1.1a1r10538 and ran > it with no bus error. Though you are running 1.1b5r10421 so we're not > running the > same thing, as of yet. > > I have a cluster of two v440 that have 4 cpus each running Solaris 10. > The tests I > am running are np=2 one process on each node. > > --td > > Eric Thibodeau wrote: > > >Terry, > > > > I was about to comment on this. could you tell me the specs of your > > machine. As you will notice in "my thread", I am running into problems on > > Sparc SPM systems where the CPU borad's RTC are in a doubtfull state. > > Are-you running 1.1 on SMP machines. If so, on how many procs and what > > hardware/OS version is this running off? > > > >ET > > > >Le mercredi 28 juin 2006 10:35, Terry D. Dontje a écrit : > > > > > >>Frank, > >> > >>Can you set your limit coredumpsize to non-zero rerun the program > >>and then get the stack via dbx? > >> > >>So, I have a similar case of BUS_ADRALN on SPARC systems with an > >>older version (June 21st) of the trunk. I've since run using the latest > >>trunk and the > >>bus went away. I am now going to try this out with v1.1 to see if I get > >>similar > >>results. Your stack would help me try and determine if this is an > >>OpenMPI issue > >>or possibly some type of platform problem. > >> > >>There is another thread with Eric Thibodeau that I am unsure if it is > >>the same issue > >>as either of our situation. > >> > >>--td > >> > >> > >[...snip...] > > > > > -- Eric Thibodeau Neural Bucket Solutions Inc. T. (514) 736-1436 C. (514) 710-0517
Re: [OMPI users] users Digest, Vol 317, Issue 4
The problems was resolved in the 1.1 series...so I didn't push any further. Thanks! Le mercredi 28 juin 2006 09:21, openmpi-user a écrit : > Hi Eric (and all), > > don't know if this really messes things up, but you have set up lam-mpi > in your path-variables, too: > > [enterprise:24786] pls:rsh: reset LD_LIBRARY_PATH: > /export/lca/home/lca0/etudiants/ac38820/openmpi_sun4u/lib:/export/lca/appl/Forte/SUNWspro/WS6U2/lib:/usr/local/lib:*/usr/local/lam-mpi/7.1.1/lib*:/opt/sfw/lib > > > Yours, > Frank > > users-requ...@open-mpi.org wrote: > Send users mailing list submissions to > > us...@open-mpi.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > or, via email, send a message with subject or body 'help' to > > users-requ...@open-mpi.org > > > > You can reach the person managing the list at > > users-ow...@open-mpi.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of users digest..." > > > > > > Today's Topics: > > > >1. Re: Installing OpenMPI on a solaris (Jeff Squyres (jsquyres)) > > > > > > -- > > > > Message: 1 > > Date: Wed, 28 Jun 2006 08:56:36 -0400 > > From: "Jeff Squyres \(jsquyres\)" <jsquy...@cisco.com> > > Subject: Re: [OMPI users] Installing OpenMPI on a solaris > > To: "Open MPI Users" <us...@open-mpi.org> > > Message-ID: > > <c835b9c9cb0f1c4e9da48913c9e8f8afae9...@xmb-rtp-215.amer.cisco.com> > > Content-Type: text/plain; charset="iso-8859-1" > > > > Bummer! :-( > > > > Just to be sure -- you had a clean config.cache file before you ran > > configure, right? (e.g., the file didn't exist -- just to be sure it > > didn't get potentially erroneous values from a previous run of configure) > > Also, FWIW, it's not necessary to specify --enable-ltdl-convenience; that > > should be automatic. > > > > If you had a clean configure, we *suspect* that this might be due to > > alignment issues on Solaris 64 bit platforms, but thought that we might > > have had a pretty good handle on it in 1.1. Obviously we didn't solve > > everything. Bonk. > > > > Did you get a corefile, perchance? If you could send a stack trace, that > > would be most helpful. > > > > > > > > > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > > Behalf Of Eric Thibodeau > > Sent: Tuesday, June 20, 2006 8:36 PM > > To: us...@open-mpi.org > > Subject: Re: [OMPI users] Installing OpenMPI on a solaris > > > > > > > > Hello Brian (and all), > > > > > > > > Well, the joy was short lived. On a 12 CPU Enterprise machine and on a > > 4 CPU one, I seem to be able to start up to 4 processes. Above 4, I seem to > > inevitably get BUS_ADRALN (Bus collisions?). Below are some traces of the > > failling runs as well as a detailed (mpirun -d) of one of these situations > > and ompi_info output. Obviously, don't hesitate to ask if more information > > is requred. > > > > > > > > Buid version: openmpi-1.1b5r10421 > > > > Config parameters: > > > > Open MPI config.status 1.1b5 > > > > configured by ./configure, generated by GNU Autoconf 2.59, > > > > with options \"'--cache-file=config.cache' 'CFLAGS=-mcpu=v9' > > 'CXXFLAGS=-mcpu=v9' 'FFLAGS=-mcpu=v9' > > '--prefix=/export/lca/home/lca0/etudiants/ac38820/openmp > > > > i_sun4u' --enable-ltdl-convenience\" > > > > > > > > The traces: > > > > sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ > > ~/openmpi_sun4u/bin/mpirun -np 10 mandelbrot-mpi 100 400 400 > > > > Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN) > > > > Failing at addr:2f4f04 > > > > *** End of error message *** > > > > sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ > > ~/openmpi_sun4u/bin/mpirun -np 8 mandelbrot-mpi 100 400 400 > > > > Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN) > > > > Failing at addr:2b354c > > > > *** End of error message *** > > > > sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ > > ~/openmpi
Re: [OMPI users] Installing OpenMPI on a solaris
Yeah bummers, but something tells me it might not be OpenMPI's fault. Here's why: 1- The tech that takes care of these machines told me that he gets RTC errors on bootup (the cpu borads are apprantly "out of sync" since the clocks aren't set correctly). 2- There is also a possibility that the prior admin did not put in a "stable" firmware version. So if any Sun guru can help out by telling me which command or point to a quick HOWTO for resolvin these clock issues, it would be greatly appreciated (our analyst is overloaded and he would not be able to justify the 3 days of reading up docs just to satisfy my running parallel code problems ;P) 3- I realised that the OS is not booted in 64 O_o!! (not that this has to do with OpenMPI bombing): Jun 21 07:45:15 unknown genunix: [ID 540533 kern.notice] ^MSunOS Release 5.8 Version Generic_108528-29 32-bit Jun 21 07:45:15 unknown NOTICE: 64-bit OS installed, but the 32-bit OS is the default Jun 21 07:45:15 unknown Booting the 32-bit OS ... 4- LAM-MPI 7.1.1 also bombs, but it does so at a much higher processor count (OpenMPI bombs at 5, LAM-MPI bombs around 10, but it vraies). As for the questions regarding OpenMPI build, I just recently built 1.1 with the same basic configure options with the exact same results (clean cache). So, I guess this one is on pause untill I have the confirmation that the clocks on the processor boards are set correctly. There is one this that bothers me though, one of the machines has only 1 processor board (4 procs) and I still get the error on that machine if I go over 4 pcrosesses...how can a board be out of sync with itself?? Eric PS: I am at liberty of providing the source code if anyone wants it. Le mercredi 28 juin 2006 08:56, Jeff Squyres (jsquyres) a écrit : > Bummer! :-( > > Just to be sure -- you had a clean config.cache file before you ran > configure, right? (e.g., the file didn't exist -- just to be sure it didn't > get potentially erroneous values from a previous run of configure) Also, > FWIW, it's not necessary to specify --enable-ltdl-convenience; that should be > automatic. > > If you had a clean configure, we *suspect* that this might be due to > alignment issues on Solaris 64 bit platforms, but thought that we might have > had a pretty good handle on it in 1.1. Obviously we didn't solve everything. > Bonk. > > Did you get a corefile, perchance? If you could send a stack trace, that > would be most helpful. > > [...snip...]
Re: [OMPI users] Installing OpenMPI on a solaris
onent v1.1) MCA timer: solaris (MCA v1.0, API v1.0, Component v1.1) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.1) MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1) MCA coll: self (MCA v1.0, API v1.0, Component v1.1) MCA coll: sm (MCA v1.0, API v1.0, Component v1.1) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1) MCA io: romio (MCA v1.0, API v1.0, Component v1.1) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1) MCA pml: dr (MCA v1.0, API v1.0, Component v1.1) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1) MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1) MCA btl: self (MCA v1.0, API v1.0, Component v1.1) MCA btl: sm (MCA v1.0, API v1.0, Component v1.1) MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.1) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0) MCA gpr: null (MCA v1.0, API v1.0, Component v1.1) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1) MCA iof: svc (MCA v1.0, API v1.0, Component v1.1) MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1) MCA ns: replica (MCA v1.0, API v1.0, Component v1.1) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1) MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1) MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1) MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1) MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1) MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1) MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1) MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1) MCA rml: oob (MCA v1.0, API v1.0, Component v1.1) MCA pls: fork (MCA v1.0, API v1.0, Component v1.1) MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1) MCA sds: env (MCA v1.0, API v1.0, Component v1.1) MCA sds: pipe (MCA v1.0, API v1.0, Component v1.1) MCA sds: seed (MCA v1.0, API v1.0, Component v1.1) MCA sds: singleton (MCA v1.0, API v1.0, Component v1.1) Le mardi 20 juin 2006 17:06, Eric Thibodeau a écrit : > Thanks for the pointer, it WORKS!! (yay) > > Le mardi 20 juin 2006 12:21, Brian Barrett a écrit : > > On Jun 19, 2006, at 12:15 PM, Eric Thibodeau wrote: > > > > > I checked the thread with the same title as this e-mail and tried > > > compiling openmpi-1.1b4r10418 with: > > > > > > ./configure CFLAGS="-mv8plus" CXXFLAGS="-mv8plus" FFLAGS="-mv8plus" > > > FCFLAGS="-mv8plus" --prefix=$HOME/openmpi-SUN-`uname -r` --enable- > > > pretty-print-stacktrace > > I put the incorrect flags in the error message - can you try again with: > > > > > >./configure CFLAGS=-mcpu=v9 CXXFLAGS=-mcpu=v9 FFLAGS=-mcpu=v9 > > FCFLAGS=-mcpu=v9 --prefix=$HOME/openmpi-SUN-`uname -r` --enable- > > pretty-print-stacktrace > > > > > > and see if that helps? By the way, I'm not sure if Solaris has the > > required support for the pretty-print stack trace feature. It likely > > will print what signal caused the error, but will not actually print > > the stack trace. It's enabled by default on Solaris, with this > > limited functionality (the option exists for platforms that have > > broken half-support for GNU libc's stack trace feature, and for users > > that don't like us registering a signal handler to do the work). > > > > Brian > > > > > -- Eric Thibodeau Neural Bucket Solutions Inc. T. (514) 736-1436 C. (514) 710-0517
Re: [OMPI users] pls:rsh: execv failed with errno=2
Hello Jeff, Fristly, don't worry about jumping in late, I'll send you a skid rope ;) Secondly, thanks for your nice little artilces on clustermonkey.net (good refresher on MPI). And finally, down to my issues, thanks for clearing out the --prefix LD_LIBRARY_PATH and all. The ebuild I made/mangled for Openmpi under Gentoo was modified by some of the devs to follow some of the lib Vs lib64 reqs. I might change them to be identicall (only $PREFIX/lib) across platforms since multi-arch MPI will be hell to get working with a changing LD_LIBRARY_PATH. After some recommendations, I tried openmpi-1.1b3r10389 on the AMD64 arch and got my MPI app running on that single sual Opteron node, I still have to figure out the --prefix/PATH/LD_LIBRARY_PATH mess to get the app to spawn across that dual Opteron node and 2 single Athlon nodes (cross arch with the variying LD_LIBRARY_PATH). But that's another issue for the moment (a bit of fiddling on my side to get orte to be recognized on the nodes) As for the sparc-sun-solaris2.8 , I tried compiling openmpi-1.1b3r10389 but it bombs with both gcc or the SUN cc: Making all in asm source='asm.c' object='asm.lo' libtool=yes \ DEPDIR=.deps depmode=none /bin/bash ../.././config/depcomp \ /bin/bash ../../libtool --tag=CC --mode=compile /export/lca/appl/Forte/SUNWspro/WS6U2/bin/cc -DHAVE_CONFIG_H -I. -I. -I../../opal/include -I../../orte/include -I../../ompi/include -I../../ompi/include -I../.. -O -DNDEBUG -mt -c -o asm.lo asm.c /export/lca/appl/Forte/SUNWspro/WS6U2/bin/cc -DHAVE_CONFIG_H -I. -I. -I../../opal/include -I../../orte/include -I../../ompi/include -I../../ompi/include -I../.. -O -DNDEBUG -mt -c asm.c -KPIC -DPIC -o .libs/asm.o "../../opal/include/opal/sys/atomic.h", line 486: #error: Atomic arithmetic on pointers not supported cc: acomp failed for asm.c *** Error code 1 I was told by one of the system's admin that the SUN Enterprise machine (12 proc) has "special" considerations when using semaphores (it's hardware implemented O_o! ), I'm only mentionning this due to the error message (Atomic arithmetic ...) So, I got half my problem resolved with the upgrade, any suggestions for compiling OpenMPI on this _old_ but very educationnal SMP machine? Eric Le vendredi 16 juin 2006 17:32, Jeff Squyres (jsquyres) a écrit : > Sorry for jumping in late... > > The /lib vs. /lib64 thing as part of --prefix was definitely broken until > recently. This behavior has been fixed in the 1.1 series. Specifically, > OMPI will take the prefix that you provided and append the basename of the > local $libdir. So if you configured OMPI with something like: > > shell$ ./configure --libdir=/some/path/lib64 ... > > And then you run: > > shell$ mpirun --prefix /some/path ... > > Then OMPI will add /some/path/lib64 to the remote LD_LIBRARY_PATH. The > previous behavior would always add "/lib" to the remote LD_LIBRARY_PATH, > regardless of what the local $libdir was (i.e., it ignored the basename of > your $libdir). > > If you have a situation more complicated than this (e.g., your $libdir is > different than your prefix by more than just the basename), then --prefix is > not the solution for you. Instead, you'll need to set your $PATH and > $LD_LIBRARY_PATH properly on all nodes (e.g., in your shell startup files). > Specifically, --prefix is meant to be an easy workaround for common > configurations where $libdir is a subdirectory under $prefix. > > Another random note: invoking mpirun with an absolute path (e.g., > /path/to/bin/mpirun) is exactly the same as specifying --prefix /path/to -- > so you don't have to do both. > > [..SNIP..]
Re: [OMPI users] pls:rsh: execv failed with errno=2
Hello, I don't want to get too much off topic in this reply but you're brigning out a point here. I am unable to run mpi apps on the AMD64 platform with the regular exporting of $LD_LIBRARY_PATH and $PATH, this is why I have no choice but to revert to using the --prefix approach. Here are a few execution examples to demonstrate my point: kyron@headless ~ $ /usr/lib64/openmpi/1.0.2-gcc-4.1/bin/mpirun --prefix /usr/lib64/openmpi/1.0.2-gcc-4.1/ -np 2 ./a.out ./a.out: error while loading shared libraries: libmpi.so.0: cannot open shared object file: No such file or directory kyron@headless ~ $ /usr/lib64/openmpi/1.0.2-gcc-4.1/bin/mpirun --prefix /usr/lib64/openmpi/1.0.2-gcc-4.1/lib64/ -np 2 ./a.out [headless:10827] pls:rsh: execv failed with errno=2 [headless:10827] ERROR: A daemon on node localhost failed to start as expected. [headless:10827] ERROR: There may be more information available from [headless:10827] ERROR: the remote shell (see above). [headless:10827] ERROR: The daemon exited unexpectedly with status 255. kyron@headless ~ $ cat opmpi64.sh #!/bin/bash MPI_BASE='/usr/lib64/openmpi/1.0.2-gcc-4.1' export PATH=$PATH:${MPI_BASE}/bin LD_LIBRARY_PATH=${MPI_BASE}/lib64 kyron@headless ~ $ . opmpi64.sh kyron@headless ~ $ mpirun -np 2 ./a.out ./a.out: error while loading shared libraries: libmpi.so.0: cannot open shared object file: No such file or directory kyron@headless ~ $ Eric Le vendredi 16 juin 2006 10:31, Pak Lui a écrit : > Hi, I noticed your prefix set to the lib dir, can you try without the > lib64 part and rerun? > > Eric Thibodeau wrote: > > Hello everyone, > > > > Well, first off, I hope this problem I am reporting is of some validity, > > I tried finding simmilar situations off Google and the mailing list but > > came up with only one reference [1] which seems invalid in my case since > > all executions are local (naïve assumptions that it makes a difference > > on the calling stack). I am trying to run asimple HelloWorld using > > OpenMPI 1.0.2 on an AMD64 machine and a Sun Enterprise (12 procs) > > machine. In both cases I get the following error: > > > > pls:rsh: execv failed with errno=2 > > > > Here is the mpirun -d trace when running my HelloWorld (on AMD64): > > > > kyron@headless ~ $ mpirun -d --prefix > > /usr/lib64/openmpi/1.0.2-gcc-4.1/lib64/ -np 4 ./hello > > > > [headless:10461] procdir: (null) > > > > [headless:10461] jobdir: (null) > > > > [headless:10461] unidir: > > /tmp/openmpi-sessions-kyron@headless_0/default-universe > > > > [headless:10461] top: openmpi-sessions-kyron@headless_0 > > > > [headless:10461] tmp: /tmp > > > > [headless:10461] [0,0,0] setting up session dir with > > > > [headless:10461] tmpdir /tmp > > > > [headless:10461] universe default-universe-10461 > > > > [headless:10461] user kyron > > > > [headless:10461] host headless > > > > [headless:10461] jobid 0 > > > > [headless:10461] procid 0 > > > > [headless:10461] procdir: > > /tmp/openmpi-sessions-kyron@headless_0/default-universe-10461/0/0 > > > > [headless:10461] jobdir: > > /tmp/openmpi-sessions-kyron@headless_0/default-universe-10461/0 > > > > [headless:10461] unidir: > > /tmp/openmpi-sessions-kyron@headless_0/default-universe-10461 > > > > [headless:10461] top: openmpi-sessions-kyron@headless_0 > > > > [headless:10461] tmp: /tmp > > > > [headless:10461] [0,0,0] contact_file > > /tmp/openmpi-sessions-kyron@headless_0/default-universe-10461/universe-setup.txt > > > > [headless:10461] [0,0,0] wrote setup file > > > > [headless:10461] spawn: in job_state_callback(jobid = 1, state = 0x1) > > > > [headless:10461] pls:rsh: local csh: 0, local bash: 1 > > > > [headless:10461] pls:rsh: assuming same remote shell as local shell > > > > [headless:10461] pls:rsh: remote csh: 0, remote bash: 1 > > > > [headless:10461] pls:rsh: final template argv: > > > > [headless:10461] pls:rsh: /usr/bin/ssh orted --debug > > --bootproxy 1 --name --num_procs 2 --vpid_start 0 --nodename > > --universe kyron@headless:default-universe-10461 --nsreplica > > "0.0.0;tcp://142.137.135.124:37657;tcp://192.168.1.1:37657" --gprreplica > > "0.0.0;tcp://142.137.135.124:37657;tcp://192.168.1.1:37657" > > --mpi-call-yield 0 > > > > [headless:10461] pls:rsh: launching on node localhost > > > > [headless:10461] pls:rsh: oversubscribed -- setting mpi_yield_when_idle > > to 1 (1 4) > > > > [headless:10461] pls:rsh: localhost is a LOCAL node > >