Re: [OMPI users] mpirun example program fail on multiple nodes- unable to launch specified application on client node
For small ad hoc COWs I'd vote for sshfs too. It may well be as slow as a dog, but it actually has some security, unlike NFS, and is a doddle to make work with no superuser access on the server, unlike NFS. On Thu, 2009-11-05 at 17:53 -0500, Jeff Squyres wrote: > On Nov 5, 2009, at 5:34 PM, Douglas Guptill wrote: > > > I am currently using sshfs to mount both OpenMPI and my application on > > the "other" computers/nodes. The advantage to this is that I have > > only one copy of OpenMPI and my application. There may be a > > performance penalty, but I haven't seen it yet. > > > > > For a small number of nodes (where small <=32 or sometimes even <=64), > I find that simple NFS works just fine. If your apps aren't IO > intensive, that can greatly simplify installation and deployment of > both Open MPI and your MPI applications IMNSHO. > > But -- every app is different. :-) YMMV. >
Re: [OMPI users] mpirun example program fail on multiple nodes- unable to launch specified application on client node
On Nov 5, 2009, at 5:34 PM, Douglas Guptill wrote: I am currently using sshfs to mount both OpenMPI and my application on the "other" computers/nodes. The advantage to this is that I have only one copy of OpenMPI and my application. There may be a performance penalty, but I haven't seen it yet. For a small number of nodes (where small <=32 or sometimes even <=64), I find that simple NFS works just fine. If your apps aren't IO intensive, that can greatly simplify installation and deployment of both Open MPI and your MPI applications IMNSHO. But -- every app is different. :-) YMMV. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] mpirun example program fail on multiple nodes - unable to launch specified application on client node
On Thu, Nov 05, 2009 at 03:15:33PM -0600, Qing Pang wrote: > Thank you Jeff! That solves the problem. :-) You are the lifesaver! > So does that means I always need to copy my application to all the > nodes? Or should I give the pathname of the my executable in a different > way to avoid this? Do I need a network file system for that? I am currently using sshfs to mount both OpenMPI and my application on the "other" computers/nodes. The advantage to this is that I have only one copy of OpenMPI and my application. There may be a performance penalty, but I haven't seen it yet. Douglas.
Re: [OMPI users] mpirun example program fail on multiple nodes - unable to launch specified application on client node
On Nov 5, 2009, at 4:15 PM, Qing Pang wrote: Thank you Jeff! That solves the problem. :-) You are the lifesaver! So does that means I always need to copy my application to all the nodes? Or should I give the pathname of the my executable in a different way to avoid this? Do I need a network file system for that? Your executable needs to be available on all nodes, yes, whether you have copied it out there or whether you use a network filesystem. For a small number of nodes, using a network filesystem is likely much more convenient. See http://www.open-mpi.org/faq/?category=running#do-i-need-a-common-filesystem -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] mpirun example program fail on multiple nodes - unable to launch specified application on client node
Thank you Jeff! That solves the problem. :-) You are the lifesaver! So does that means I always need to copy my application to all the nodes? Or should I give the pathname of the my executable in a different way to avoid this? Do I need a network file system for that? Jeff Squyres wrote: The short version of the answer is to check to see that the executable is in the same location on both nodes (apparently: /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out). Open MPI is complaining that it can't find that specific executable on the .194 node. See below for more detail. On Nov 5, 2009, at 3:19 PM, qing pang wrote: 1) I'm trying to run opemMPI with the following setting: 1 PC (as master node) and 1 notebook (as client node) connected to an ethernet router through ethernet cable. Both running Ubuntu 8.10. There's no other connections. - Is this setting OK to run OpenMPI? Yes. 2) Prerequisites SSH has been set up so that the master node can access the client node through passwordless ssh. I do notice that it takes 10~15 seconds between me entering '>ssh 'command and getting onto the client node. --- Could this be too slow for openmpi to run properlly? Nope -- should be ok. I do not have programs like network file system, network time protocol, resource management, scheduler, etc installed. --- Does OpenMPI need any prerequites other than passwordless ssh? Not in this case, no. 3) OpenMPI is installed on both nodes - downloaded from open-mpi.org, and do configure/make all using Default Settings. 4) PATH and LD_LIBRARY_PATH On both nodes, PATH is /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games, which is the default setting in ubuntu. LD_LIBRARY_PATH is set in ~/.bashrc - I added one line at the end of the file, 'export LD_LIBRARY_PATH=usr/local/lib:usr/lib' So when I echo them on both nodes, I get: >echo $PATH >/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games >echo $LD_LIBRARY_PATH >usr/local/lib:usr/lib But, if I do >ssh 'echo $LD_LIBRARY_PATH' nothing comes back. while >ssh 'echo $PATH' comes back with the right path. Is that a problem? No. 4) Problem: I compiled the example Hello_c using >mpicc hello_c.c -o hello_c.out and run them on both nodes locally, everything works fine. But when I tried to run it on 2 nodes (-np 2) >mpirun -machinefile machine.linux -np 2 $(pwd)/hello_c.out I got the following error: gordon@gordon-desktop:~/Desktop/openmpi-1.3.3/examples$ mpirun --machinefile machine.linux -np 2 $(pwd)/hello_c.out -- mpirun was unable to launch the specified application as it could not access or execute an executable: Executable: /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out Node: 192.168.0.194 You are giving an absolute pathname in the mpirun command line: mpirun -machinefile machine.linux -np 2 $(pwd)/hello_c.out Hence, it's looking for exactly /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out on both nodes. If the executable is in a different directory on the other node, that's where you're probably running into the problem.
Re: [OMPI users] mpirun example program fail on multiple nodes - unable to launch specified application on client node
The short version of the answer is to check to see that the executable is in the same location on both nodes (apparently: /home/gordon/ Desktop/openmpi-1.3.3/examples/hello_c.out). Open MPI is complaining that it can't find that specific executable on the .194 node. See below for more detail. On Nov 5, 2009, at 3:19 PM, qing pang wrote: 1) I'm trying to run opemMPI with the following setting: 1 PC (as master node) and 1 notebook (as client node) connected to an ethernet router through ethernet cable. Both running Ubuntu 8.10. There's no other connections. - Is this setting OK to run OpenMPI? Yes. 2) Prerequisites SSH has been set up so that the master node can access the client node through passwordless ssh. I do notice that it takes 10~15 seconds between me entering '>ssh 'command and getting onto the client node. --- Could this be too slow for openmpi to run properlly? Nope -- should be ok. I do not have programs like network file system, network time protocol, resource management, scheduler, etc installed. --- Does OpenMPI need any prerequites other than passwordless ssh? Not in this case, no. 3) OpenMPI is installed on both nodes - downloaded from open-mpi.org, and do configure/make all using Default Settings. 4) PATH and LD_LIBRARY_PATH On both nodes, PATH is /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/ games, which is the default setting in ubuntu. LD_LIBRARY_PATH is set in ~/.bashrc - I added one line at the end of the file, 'export LD_LIBRARY_PATH=usr/local/lib:usr/lib' So when I echo them on both nodes, I get: >echo $PATH >/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/ games >echo $LD_LIBRARY_PATH >usr/local/lib:usr/lib But, if I do >ssh 'echo $LD_LIBRARY_PATH' nothing comes back. while >ssh 'echo $PATH' comes back with the right path. Is that a problem? No. 4) Problem: I compiled the example Hello_c using >mpicc hello_c.c -o hello_c.out and run them on both nodes locally, everything works fine. But when I tried to run it on 2 nodes (-np 2) >mpirun -machinefile machine.linux -np 2 $(pwd)/hello_c.out I got the following error: gordon@gordon-desktop:~/Desktop/openmpi-1.3.3/examples$ mpirun --machinefile machine.linux -np 2 $(pwd)/hello_c.out -- mpirun was unable to launch the specified application as it could not access or execute an executable: Executable: /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out Node: 192.168.0.194 You are giving an absolute pathname in the mpirun command line: mpirun -machinefile machine.linux -np 2 $(pwd)/hello_c.out Hence, it's looking for exactly /home/gordon/Desktop/openmpi-1.3.3/ examples/hello_c.out on both nodes. If the executable is in a different directory on the other node, that's where you're probably running into the problem. -- Jeff Squyres jsquy...@cisco.com
[OMPI users] mpirun example program fail on multiple nodes - unable to launch specified application on client node
Dear Sir/Madam, I'm having problem running example program. Please kindly help --- I've been fooling with it for days, kind of getting lost. - MPIRUN fails on example hello prgram -unable to launch the specified application on client node - 1) I'm trying to run opemMPI with the following setting: 1 PC (as master node) and 1 notebook (as client node) connected to an ethernet router through ethernet cable. Both running Ubuntu 8.10. There's no other connections. - Is this setting OK to run OpenMPI? 2) Prerequisites SSH has been set up so that the master node can access the client node through passwordless ssh. I do notice that it takes 10~15 seconds between me entering '>ssh 'command and getting onto the client node. --- Could this be too slow for openmpi to run properlly? I do not have programs like network file system, network time protocol, resource management, scheduler, etc installed. --- Does OpenMPI need any prerequites other than passwordless ssh? 3) OpenMPI is installed on both nodes - downloaded from open-mpi.org, and do configure/make all using Default Settings. 4) PATH and LD_LIBRARY_PATH On both nodes, PATH is /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games, which is the default setting in ubuntu. LD_LIBRARY_PATH is set in ~/.bashrc - I added one line at the end of the file, 'export LD_LIBRARY_PATH=usr/local/lib:usr/lib' So when I echo them on both nodes, I get: >echo $PATH >/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games >echo $LD_LIBRARY_PATH >usr/local/lib:usr/lib But, if I do >ssh 'echo $LD_LIBRARY_PATH' nothing comes back. while >ssh 'echo $PATH' comes back with the right path. Is that a problem? 4) Problem: I compiled the example Hello_c using >mpicc hello_c.c -o hello_c.out and run them on both nodes locally, everything works fine. But when I tried to run it on 2 nodes (-np 2) >mpirun -machinefile machine.linux -np 2 $(pwd)/hello_c.out I got the following error: gordon@gordon-desktop:~/Desktop/openmpi-1.3.3/examples$ mpirun --machinefile machine.linux -np 2 $(pwd)/hello_c.out -- mpirun was unable to launch the specified application as it could not access or execute an executable: Executable: /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out Node: 192.168.0.194 while attempting to start process rank 1. -- Sometimes I get one other error message after that: -- [gordon-desktop:30748] [[25975,0],0]-[[25975,1],0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) -- 5) Infomation attached: ifconfig_masternode - output of ifconfig on masternode ifconfig_slavenode - output of ifconfig on slavenode ompi_info.txt - output of ompi_info -all config.log - OpenMPI logfile machine.linux - the machinefile used in mpirun command -- Sincerely, Qing Pang (601) 979 0270 mpirun_info.tar.gz Description: application/gzip - MPIRUN fails on example hello prgram -unable to launch the specified application on client node - 1) I'm trying to run opemMPI with the following setting: 1 PC (as master node) and 1 notebook (as client node) connected to an ethernet router through ethernet cable. Both running Ubuntu 8.10. There's no other connections. - Is this setting OK to run OpenMPI? 2) Prerequisites SSH has been set up so that the master node can access the client node through passwordless ssh. I do notice that it takes 10~15 seconds between me entering '>ssh 'command and getting onto the client node. - Can this be too slow for openmpi to run properlly? I do not have programs like network file system, network time protocol, resource management, scheduler, etc installed. - Does OpenMPI have any prerequites other than passwordless ssh? 3) OpenMPI is installed on both nodes - downloaded from open-mpi.org, and do configure/make all using Default Settings. 4) PATH and LD_LIBRARY_PATH On both nodes, PATH is /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games, which is the default setting in ubuntu. LD_LIBRARY_PATH is set in ~/.bashrc - I added one line at the end of the file, 'export LD_LIBRARY_PATH=usr/local/lib:usr/lib' So when I echo them on both nodes, I get: >echo $PATH >/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games >echo $LD_LIBRARY_PATH >usr/local/lib:usr/lib
Re: [OMPI users] Openmpi on Heterogeneous environment
I have had issues for running in cross platforms..ie. Mac OSX and Linux (Ubuntu)..haven't got it resolved..check firewalls if thats blocking any communication.. > Dear Open-mpi users, > > I have installed openmpi on 2 different machines with different > architectures (INTEL and x86_64) separately (command: ./configure > --enable-heterogeneous). Compiled executables of the same code for these 2 > arch. Kept these executables on individual machines. Prepared a hostfile > containing the names of those 2 machines. > Now, when I want to execute the code (giving command - ./mpirun -hostfile > machines executable), it doesn't work, giving error message: > > MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD > with errorcode 1. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -- > -- > mpirun has exited due to process rank 2 with PID 1712 on > node studpc1.xxx..xx exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here) > > When I keep only one machine-name in the hostfile, then the execution > works > perfect. > > Will anybody please guide me to run the program on heterogeneous > environment > using mpirun! > > Thanking you, > > Sincerely, > Yogesh > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Openmpi on Heterogeneous environment
Dear Open-mpi users, I have installed openmpi on 2 different machines with different architectures (INTEL and x86_64) separately (command: ./configure --enable-heterogeneous). Compiled executables of the same code for these 2 arch. Kept these executables on individual machines. Prepared a hostfile containing the names of those 2 machines. Now, when I want to execute the code (giving command - ./mpirun -hostfile machines executable), it doesn't work, giving error message: MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -- -- mpirun has exited due to process rank 2 with PID 1712 on node studpc1.xxx..xx exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here) When I keep only one machine-name in the hostfile, then the execution works perfect. Will anybody please guide me to run the program on heterogeneous environment using mpirun! Thanking you, Sincerely, Yogesh
Re: [OMPI users] Help: Firewall problems
Technically MPI Spec may not put a requirement on TCP/IP, however Open MPI's runtime environment needs some way to launch jobs and pass data around in a standard way and it currently uses TCP/IP. That being said there have been rumblings for some time to use other protocols but that has not yet come into being. --td *Subject:* [OMPI users] Help: Firewall problems *From:* Lee Amy (/openlinuxsource_at_[hidden]/) *Date:* 2009-11-05 11:28:50 Hi, I remembered MPI does not count on TCP/IP but why default iptables will prevent the MPI programs from running? After I stop iptables then programs run well. I use Ethernet as connection. Could anyone tell me tips about fix this problem? Thank you very much. Amy
Re: [OMPI users] Segmentation fault whilst running RaXML-MPI
FWIW, I think Intel released 11.1.059 earlier today (I've been trying to download it all morning). I doubt it's an issue in this case, but I thought I'd mention it as a public service announcement. ;-) Seg faults are *usually* an application issue (never say "never", but they *usually* are). You might want to first contact the RaXML team to see if there are any known issues with their software and Open MPI 1.3.3...? (Sorry, I'm totally unfamiliar with RaXML) On Nov 5, 2009, at 12:30 PM, Nick Holway wrote: Dear all, I'm trying to run RaXML 7.0.4 on my 64bit Rocks 5.1 cluster (ie Centos 5.2). I compiled Open MPI 1.3.3 using the Intel compilers v 11.1.056 using ./configure CC=icc CXX=icpc F77=ifort FC=ifort --with-sge --prefix=/usr/prog/mpi/openmpi/1.3.3/x86_64-no-mem-man --with-memory-manager=none. When I run run RaXML in a qlogin session using /usr/prog/mpi/openmpi/1.3.3/x86_64-no-mem-man/bin/mpirun -np 8 /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI -f a -x 12345 -p12345 -# 10 -m GTRGAMMA -s /users/holwani1/jay/ornodko-1582 -n mpitest39 I get the following output: This is the RAxML MPI Worker Process Number: 1 This is the RAxML MPI Worker Process Number: 3 This is the RAxML MPI Master process This is the RAxML MPI Worker Process Number: 7 This is the RAxML MPI Worker Process Number: 4 This is the RAxML MPI Worker Process Number: 5 This is the RAxML MPI Worker Process Number: 2 This is the RAxML MPI Worker Process Number: 6 IMPORTANT WARNING: Alignment column 1695 contains only undetermined values which will be treated as missing data IMPORTANT WARNING: Sequences A4_H10 and A3ii_E11 are exactly identical IMPORTANT WARNING: Sequences A2_A08 and A9_C10 are exactly identical IMPORTANT WARNING: Sequences A3ii_B03 and A3ii_C06 are exactly identical IMPORTANT WARNING: Sequences A9_D08 and A9_F10 are exactly identical IMPORTANT WARNING: Sequences A3ii_F07 and A9_C08 are exactly identical IMPORTANT WARNING: Sequences A6_F05 and A6_F11 are exactly identical IMPORTANT WARNING Found 6 sequences that are exactly identical to other sequences in the alignment. Normally they should be excluded from the analysis. IMPORTANT WARNING Found 1 column that contains only undetermined values which will be treated as missing data. Normally these columns should be excluded from the analysis. An alignment file with undetermined columns and sequence duplicates removed has already been printed to file /users/holwani1/jay/ornodko-1582.reduced You are using RAxML version 7.0.4 released by Alexandros Stamatakis in April 2008 Alignment has 1280 distinct alignment patterns Proportion of gaps and completely undetermined characters in this alignment: 0.124198 RAxML rapid bootstrapping and subsequent ML search Executing 10 rapid bootstrap inferences and thereafter a thorough ML search All free model parameters will be estimated by RAxML GAMMA model of rate heteorgeneity, ML estimate of alpha-parameter GAMMA Model parameters will be estimated up to an accuracy of 0.10 Log Likelihood units Partition: 0 Name: No Name Provided DataType: DNA Substitution Matrix: GTR Empirical Base Frequencies: pi(A): 0.261129 pi(C): 0.228570 pi(G): 0.315946 pi(T): 0.194354 Switching from GAMMA to CAT for rapid Bootstrap, final ML search will be conducted under the GAMMA model you specified Bootstrap[10]: Time 44.442728 bootstrap likelihood -inf, best rearrangement setting 5 Bootstrap[0]: Time 44.814948 bootstrap likelihood -inf, best rearrangement setting 5 Bootstrap[6]: Time 46.470371 bootstrap likelihood -inf, best rearrangement setting 6 [compute-0-11:08698] *** Process received signal *** [compute-0-11:08698] Signal: Segmentation fault (11) [compute-0-11:08698] Signal code: Address not mapped (1) [compute-0-11:08698] Failing at address: 0x408 [compute-0-11:08698] [ 0] /lib64/libpthread.so.0 [0x3fb580de80] [compute-0-11:08698] [ 1] /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC- MPI(hookup+0) [0x413ca0] [compute-0-11:08698] [ 2] /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC- MPI(restoreTL+0xd9) [0x442c09] [compute-0-11:08698] [ 3] /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI [0x42c968] [compute-0-11:08698] [ 4] /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC- MPI(doAllInOne+0x91a) [0x42b21a] [compute-0-11:08698] [ 5] /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC- MPI(main+0xc25) [0x4063f5] [compute-0-11:08698] [ 6] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3fb501d8b4] [compute-0-11:08698] [ 7] /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI [0x405719] [compute-0-11:08698] *** End of error message *** Bootstrap[1]: Time 8.400332 bootstrap likelihood -inf, best rearrangement setting 5 -- mpirun noticed that process rank 1 with PID 8698 on node compute-0-11.local exited on
[OMPI users] Segmentation fault whilst running RaXML-MPI
Dear all, I'm trying to run RaXML 7.0.4 on my 64bit Rocks 5.1 cluster (ie Centos 5.2). I compiled Open MPI 1.3.3 using the Intel compilers v 11.1.056 using ./configure CC=icc CXX=icpc F77=ifort FC=ifort --with-sge --prefix=/usr/prog/mpi/openmpi/1.3.3/x86_64-no-mem-man --with-memory-manager=none. When I run run RaXML in a qlogin session using /usr/prog/mpi/openmpi/1.3.3/x86_64-no-mem-man/bin/mpirun -np 8 /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI -f a -x 12345 -p12345 -# 10 -m GTRGAMMA -s /users/holwani1/jay/ornodko-1582 -n mpitest39 I get the following output: This is the RAxML MPI Worker Process Number: 1 This is the RAxML MPI Worker Process Number: 3 This is the RAxML MPI Master process This is the RAxML MPI Worker Process Number: 7 This is the RAxML MPI Worker Process Number: 4 This is the RAxML MPI Worker Process Number: 5 This is the RAxML MPI Worker Process Number: 2 This is the RAxML MPI Worker Process Number: 6 IMPORTANT WARNING: Alignment column 1695 contains only undetermined values which will be treated as missing data IMPORTANT WARNING: Sequences A4_H10 and A3ii_E11 are exactly identical IMPORTANT WARNING: Sequences A2_A08 and A9_C10 are exactly identical IMPORTANT WARNING: Sequences A3ii_B03 and A3ii_C06 are exactly identical IMPORTANT WARNING: Sequences A9_D08 and A9_F10 are exactly identical IMPORTANT WARNING: Sequences A3ii_F07 and A9_C08 are exactly identical IMPORTANT WARNING: Sequences A6_F05 and A6_F11 are exactly identical IMPORTANT WARNING Found 6 sequences that are exactly identical to other sequences in the alignment. Normally they should be excluded from the analysis. IMPORTANT WARNING Found 1 column that contains only undetermined values which will be treated as missing data. Normally these columns should be excluded from the analysis. An alignment file with undetermined columns and sequence duplicates removed has already been printed to file /users/holwani1/jay/ornodko-1582.reduced You are using RAxML version 7.0.4 released by Alexandros Stamatakis in April 2008 Alignment has 1280 distinct alignment patterns Proportion of gaps and completely undetermined characters in this alignment: 0.124198 RAxML rapid bootstrapping and subsequent ML search Executing 10 rapid bootstrap inferences and thereafter a thorough ML search All free model parameters will be estimated by RAxML GAMMA model of rate heteorgeneity, ML estimate of alpha-parameter GAMMA Model parameters will be estimated up to an accuracy of 0.10 Log Likelihood units Partition: 0 Name: No Name Provided DataType: DNA Substitution Matrix: GTR Empirical Base Frequencies: pi(A): 0.261129 pi(C): 0.228570 pi(G): 0.315946 pi(T): 0.194354 Switching from GAMMA to CAT for rapid Bootstrap, final ML search will be conducted under the GAMMA model you specified Bootstrap[10]: Time 44.442728 bootstrap likelihood -inf, best rearrangement setting 5 Bootstrap[0]: Time 44.814948 bootstrap likelihood -inf, best rearrangement setting 5 Bootstrap[6]: Time 46.470371 bootstrap likelihood -inf, best rearrangement setting 6 [compute-0-11:08698] *** Process received signal *** [compute-0-11:08698] Signal: Segmentation fault (11) [compute-0-11:08698] Signal code: Address not mapped (1) [compute-0-11:08698] Failing at address: 0x408 [compute-0-11:08698] [ 0] /lib64/libpthread.so.0 [0x3fb580de80] [compute-0-11:08698] [ 1] /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI(hookup+0) [0x413ca0] [compute-0-11:08698] [ 2] /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI(restoreTL+0xd9) [0x442c09] [compute-0-11:08698] [ 3] /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI [0x42c968] [compute-0-11:08698] [ 4] /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI(doAllInOne+0x91a) [0x42b21a] [compute-0-11:08698] [ 5] /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI(main+0xc25) [0x4063f5] [compute-0-11:08698] [ 6] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3fb501d8b4] [compute-0-11:08698] [ 7] /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI [0x405719] [compute-0-11:08698] *** End of error message *** Bootstrap[1]: Time 8.400332 bootstrap likelihood -inf, best rearrangement setting 5 -- mpirun noticed that process rank 1 with PID 8698 on node compute-0-11.local exited on signal 11 (Segmentation fault). -- My $PATH is
[OMPI users] Help: Firewall problems
Hi, I remembered MPI does not count on TCP/IP but why default iptables will prevent the MPI programs from running? After I stop iptables then programs run well. I use Ethernet as connection. Could anyone tell me tips about fix this problem? Thank you very much. Amy
Re: [OMPI users] Mac OSX 10.6 (SL) + openMPI 1.3.3 + IntelCompilers 11.1.076
On Nov 5, 2009, at 9:00 AM, Christophe Peyret wrote: How can I deactivate Xgrid launching in order to be able to use open- mpi under snow leopard ? Easiest way is to just remove the xgrid plugin: rm where_you_installed_ompi/lib/openmpi/mca_*xgrid* -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] Mac OSX 10.6 (SL) + openMPI 1.3.3 + Intel Compilers 11.1.076
How can I deactivate Xgrid launching in order to be able to use open- mpi under snow leopard ? Le 5 nov. 2009 à 13:18, Christophe Peyret a écrit : Hello, I'm trying to launch a job with mpirun on my Mac Pro and I have a strange error message, any idea ? Christophe [santafe.onera:00235] orte:plm:xgrid: Connection to XGrid controller unexpectedly closed: (600) The operation couldn’t be completed. (BEEP error 600.) 2009-11-05 13:13:53.973 orted[235:903] *** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '*** - [XGConnection<0x100224df0> finalize]: called when collecting not enabled' *** Call stack at first throw: ( 0 CoreFoundation 0x7fff8712c5a4 __exceptionPreprocess + 180 1 libobjc.A.dylib 0x7fff87b8d313 objc_exception_throw + 45 2 CoreFoundation 0x7fff87147251 - [NSObject(NSObject) finalize] + 129 3 mca_plm_xgrid.so0x000100149720 - [PlmXGridClient dealloc] + 64 4 mca_plm_xgrid.so0x0001001480e0 orte_plm_xgrid_finalize + 64 5 mca_plm_xgrid.so0x000100147fa1 orte_plm_xgrid_component_query + 529 6 libopen-pal.0.dylib 0x0001000811ea mca_base_select + 186 ) terminate called after throwing an instance of 'NSException' [santafe:00235] *** Process received signal *** [santafe:00235] Signal: Abort trap (6) [santafe:00235] Signal code: (0) [santafe:00235] *** End of error message *** [santafe.onera:00233] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 381 [santafe.onera:00233] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 143 [santafe.onera:00233] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file runtime/orte_init.c at line 132 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Unable to start a daemon on the local node (-128) instead of ORTE_SUCCESS -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: orte_init failed --> Returned "Unable to start a daemon on the local node" (-128) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [santafe.onera:233] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! santafe:Example peyret$
Re: [OMPI users] Mac OSX 10.6 (SL) + openMPI 1.3.3 + Intel Compilers11.1.076
I'm afraid that Open MPI v1.3.x's xgrid support is currently broken -- we haven't had anyone with the knowledge or experience available to fix it. :-( Patches would be welcome... Note that Open MPI itself works fine on Snow Leopard -- it's just the xgrid launching support that is broken. On Nov 5, 2009, at 7:18 AM, Christophe Peyret wrote: Hello, I'm trying to launch a job with mpirun on my Mac Pro and I have a strange error message, any idea ? Christophe [santafe.onera:00235] orte:plm:xgrid: Connection to XGrid controller unexpectedly closed: (600) The operation couldn’t be completed. (BEEP error 600.) 2009-11-05 13:13:53.973 orted[235:903] *** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '*** - [XGConnection<0x100224df0> finalize]: called when collecting not enabled' *** Call stack at first throw: ( 0 CoreFoundation 0x7fff8712c5a4 __exceptionPreprocess + 180 1 libobjc.A.dylib 0x7fff87b8d313 objc_exception_throw + 45 2 CoreFoundation 0x7fff87147251 - [NSObject(NSObject) finalize] + 129 3 mca_plm_xgrid.so0x000100149720 - [PlmXGridClient dealloc] + 64 4 mca_plm_xgrid.so0x0001001480e0 orte_plm_xgrid_finalize + 64 5 mca_plm_xgrid.so0x000100147fa1 orte_plm_xgrid_component_query + 529 6 libopen-pal.0.dylib 0x0001000811ea mca_base_select + 186 ) terminate called after throwing an instance of 'NSException' [santafe:00235] *** Process received signal *** [santafe:00235] Signal: Abort trap (6) [santafe:00235] Signal code: (0) [santafe:00235] *** End of error message *** [santafe.onera:00233] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 381 [santafe.onera:00233] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 143 [santafe.onera:00233] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file runtime/orte_init.c at line 132 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Unable to start a daemon on the local node (-128) instead of ORTE_SUCCESS -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: orte_init failed --> Returned "Unable to start a daemon on the local node" (-128) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [santafe.onera:233] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! santafe:Example peyret$ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com
[OMPI users] Mac OSX 10.6 (SL) + openMPI 1.3.3 + Intel Compilers 11.1.076
Hello, I'm trying to launch a job with mpirun on my Mac Pro and I have a strange error message, any idea ? Christophe [santafe.onera:00235] orte:plm:xgrid: Connection to XGrid controller unexpectedly closed: (600) The operation couldn’t be completed. (BEEP error 600.) 2009-11-05 13:13:53.973 orted[235:903] *** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '*** - [XGConnection<0x100224df0> finalize]: called when collecting not enabled' *** Call stack at first throw: ( 0 CoreFoundation 0x7fff8712c5a4 __exceptionPreprocess + 180 1 libobjc.A.dylib 0x7fff87b8d313 objc_exception_throw + 45 2 CoreFoundation 0x7fff87147251 -[NSObject (NSObject) finalize] + 129 3 mca_plm_xgrid.so0x000100149720 - [PlmXGridClient dealloc] + 64 4 mca_plm_xgrid.so0x0001001480e0 orte_plm_xgrid_finalize + 64 5 mca_plm_xgrid.so0x000100147fa1 orte_plm_xgrid_component_query + 529 6 libopen-pal.0.dylib 0x0001000811ea mca_base_select + 186 ) terminate called after throwing an instance of 'NSException' [santafe:00235] *** Process received signal *** [santafe:00235] Signal: Abort trap (6) [santafe:00235] Signal code: (0) [santafe:00235] *** End of error message *** [santafe.onera:00233] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 381 [santafe.onera:00233] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 143 [santafe.onera:00233] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file runtime/orte_init.c at line 132 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Unable to start a daemon on the local node (-128) instead of ORTE_SUCCESS -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: orte_init failed --> Returned "Unable to start a daemon on the local node" (-128) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [santafe.onera:233] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! santafe:Example peyret$
Re: [OMPI users] Question about checkpoint/restart protocol
Dear Sergio, Thank you for your reply. I've inserted the modules into the kernel and it all worked fine. But there is still a weired issue. I use the command "mpirun -n 2 -am ft-enable-cr -H comp001 checkpoint-restart-test" to start the an mpi job. I then use "ompi-checkpoint PID" to checkpoint a job, but the ompi-checkpoint didn't respond and the mpirun produces the following. -- An MPI process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your MPI job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged. The process that invoked fork was: Local host: comp001.local (PID 23514) MPI_COMM_WORLD rank: 0 If you are *absolutely sure* that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0. -- [login01.local:21425] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:warn-fork [login01.local:21425] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages Notice: if the -n option has a value more than 1, then this error occurs, but if the -n option has the value 1 then the ompi-checkpoint succeeds, mpirun produces the same message and ompi-restart fails with the message [login01:21417] *** Process received signal *** [login01:21417] Signal: Segmentation fault (11) [login01:21417] Signal code: Address not mapped (1) [login01:21417] Failing at address: (nil) [login01:21417] [ 0] /lib64/libpthread.so.0 [0x32df20de70] [login01:21417] [ 1] /home/mab/openmpi-1.3.3/lib/openmpi/mca_crs_blcr.so [0x2b093509dfee] [login01:21417] [ 2] /home/mab/openmpi-1.3.3/lib/openmpi/mca_crs_blcr.so(opal_crs_blcr_restart+0xd9) [0x2b093509d251] [login01:21417] [ 3] opal-restart [0x401c3e] [login01:21417] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4) [0x32dea1d8b4] [login01:21417] [ 5] opal-restart [0x401399] [login01:21417] *** End of error message *** -- mpirun noticed that process rank 0 with PID 21417 on node login01.local exited on signal 11 (Segmentation fault). -- Any help with that will be appreciated? Thanks in advance, Mohamed Adel From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of Sergio Díaz [sd...@cesga.es] Sent: Thursday, November 05, 2009 11:38 AM To: Open MPI Users Subject: Re: [OMPI users] Question about checkpoint/restart protocol Hi, Did you load the BLCR modules before compiling OpenMPI? Regards, Sergio Mohamed Adel escribió: > Dear OMPI users, > > I'm a new OpenMPI user. I've configured openmpi-1.3.3 with those options > "./configure --prefix=/home/mab/openmpi-1.3.3 --with-sge --enable-ft-thread > --with-ft=cr --enable-mpi-threads --enable-static --disable-shared > --with-blcr=/home/mab/blcr-0.8.2/" then compiled and installed it > successfully. > Now I'm trying to use the checkpoint/restart protocol. I run a program with > the options "mpirun -n 2 -am ft-enable-cr -H localhost > prime/checkpoint-restart-test" but I receive the following error: > > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [madel:28896] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > -- > It looks like opal_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during opal_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > opal_cr_init() failed failed > --> Returned value -1 instead of OPAL_SUCCESS > -- > [madel:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file > runtime/orte_init.c at line 77 > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open
Re: [OMPI users] Question about checkpoint/restart protocol
Hi, Did you load the BLCR modules before compiling OpenMPI? Regards, Sergio Mohamed Adel escribió: Dear OMPI users, I'm a new OpenMPI user. I've configured openmpi-1.3.3 with those options "./configure --prefix=/home/mab/openmpi-1.3.3 --with-sge --enable-ft-thread --with-ft=cr --enable-mpi-threads --enable-static --disable-shared --with-blcr=/home/mab/blcr-0.8.2/" then compiled and installed it successfully. Now I'm trying to use the checkpoint/restart protocol. I run a program with the options "mpirun -n 2 -am ft-enable-cr -H localhost prime/checkpoint-restart-test" but I receive the following error: *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [madel:28896] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! -- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_cr_init() failed failed --> Returned value -1 instead of OPAL_SUCCESS -- [madel:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 77 -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: orte_init failed --> Returned "Error" (-1) instead of "Success" (0) -- I can't find the files mentioned in this post "http://www.open-mpi.org/community/lists/users/2009/09/10641.php; (mca_crs_blcr.so, mca_crs_blcr.la). Could you please help me with that error? Thanks in advance Mohamed Adel ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Sergio Díaz Montes Centro de Supercomputacion de Galicia Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain) Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16 email: sd...@cesga.es ; http://www.cesga.es/