Re: [OMPI users] OpenMPI Giving problems when using -mca btl mx, sm, self
Hi, On Monday 01 October 2007 03:08:04 am Hammad Siddiqi wrote: > One more thing to add -mca mtl mx uses ethernet and IP emulation of > Myrinet to my knowledge. I want to use Myrinet(not its IP Emulation) > and shared memory simultaneously. This is not true (as far as I know...). Open MPI has 2 different network stacks, and we can use MX with either. See: http://www.open-mpi.org/faq/?category=myrinet#myri-btl-mx The mx mtl relies on the MX library for all communications, and the MX library itself does shared memory message passing. In my experience the mx mtl performs better than the mx,sm,self btl combination. However, I would encourage you to try both with your application and would be interested in hearing your opinion. > > *1. /opt/SUNWhpc/HPC7.0/bin/mpirun -np 2 -mca btl mx,sm,self -host > > "indus1,indus2" -mca btl_base_debug 1000 ./hello* > > > > /opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca btl mx,sm,self -host > > "indus1,indus2,indus3,indus4" -mca btl_base_debug 1000 ./hello > > [indus1:29331] select: initializing btl component mx > > [indus1:29331] select: init returned failure > > [indus1:29331] select: module mx unloaded So it looks like we are trying to load the mx library, but fail for some reason. Are you sure MX is working correctly? Can you run mx_pingpong between indus1 and indus2 as described here: http://www.myri.com/cgi-bin/fom.pl?file=455=file%253D91 > > *2.1 /opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca mtl mx -host > > "indus1,indus2,indus3,indus4" ./hello* > > > > This command works fine Since you did not specify to use the cm pml (which MUST be done to use the mx mtl. see: http://www.open-mpi.org/faq/?category=myrinet#myri-btl-mx), you were probably actually using tcp for this run since we would automatically fail back after the mx btl fails to load. > > *2.2 /opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca mtl mx -host > > "indus1,indus2,indus3,indus4" -mca pml cm ./hello* > > > > This command works fine. Good. So maybe there isn't anything wrong with your mx setup. > > Also *"/opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca pml cm -host > > "indus1,indus2,indus3,indus4" -mca mtl_base_debug 1000 ./hello"*, > > this command works fine. Since you selected the cm pml, we should be automatically using the mx mtl here. > > but *"/opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca pml cm -host > > "indus1,indus2,indus3,indus4" -mca mtl_base_debug 1000 ./hello"* > > hangs for indefinite time. Strange. I do not know why this would hang. > > Also *"/opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca mtl mx,sm,self -host > > "indus1,indus2,indus3,indus4" -mca mtl_base_debug 1000 ./hello"* > > works fine Again, you are falling back to using the tcp btl here. BTW, the mtl string 'mx,sm,self' is bogus. There is no sm or self mtl's. > > > > *2.3 /opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca mtl mx -host > > "indus1,indus2,indus3,indus4" -mca pml cm ./hello* > > > > This command hangs the machines for indefinite time. > > Also *"/opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca mtl mx -host > > "indus1,indus2,indus3,indus4" -mca pml cm -mca mtl_base_debug 1000 > > ./hello"* hangs the > > systems for indefinite time. These two commands should have the exact same effect as the hang above. > > > > *2.4 /opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca mtl mx,sm,self -host > > "indus1,indus2,indus3,indus4" -mca pml cm -mca mtl_base_debug 1000 > > ./hello* > > > > This command hangs the machines for indefinite time. Again, the mtl line here is bogus. > > > > Please notice that running more than four mpi processes hangs the > > machines. Any suggestion please. The first thing I would try is to see if a non-mpi application works. So try: /opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -host "indus1,indus2,indus3,indus4" hostname If that works, then try a simple MPI hello application that does no communication. Tim > > > > > > The output of *mx_info* on each node is given below > > > > =*= > > indus1 > > *== > > > > MX Version: 1.1.7rc3cvs1_1_fixes > > MX Build: @indus4:/opt/mx2g-1.1.7rc3 Thu May 31 11:36:59 PKT 2007 > > 2 Myrinet boards installed. > > The MX driver is configured to support up to 4 instances and 1024 > > nodes. > > === > > Instance #0: 333.2 MHz LANai, 66.7 MHz PCI bus, 2 MB SRAM > > Status: Running, P0: Link up > > MAC Address: 00:60:dd:47:ad:7c > > Product code: M3F-PCIXF-2 > > Part number: 09-03392 > > Serial number: 297218 > > Mapper: 00:60:dd:47:b3:e8, version = 0x7677b8ba, configured > > Mapped hosts: 10 > > > > > > ROUTE COUNT > > INDEX MAC ADDRESS HOST NAME P0 > > - --- > > - --- > >0) 00:60:dd:47:ad:7c indus1:0 1,1 > >2) 00:60:dd:47:ad:68 indus4:0 8,3 > >3) 00:60:dd:47:b3:e8 indus4:1 7,3 > >4)
Re: [OMPI users] OpenMPI Giving problems when using -mca btl mx, sm, self
Dear Tim, Your and Tim Matox's suggestion yielded following results, *1. /opt/SUNWhpc/HPC7.0/bin/mpirun -np 2 -mca btl mx,sm,self -host "indus1,indus2" -mca btl_base_debug 1000 ./hello* /opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca btl mx,sm,self -host "indus1,indus2,indus3,indus4" -mca btl_base_debug 1000 ./hello [indus1:29331] select: initializing btl component mx [indus1:29331] select: init returned failure [indus1:29331] select: module mx unloaded [indus1:29331] select: initializing btl component sm [indus1:29331] select: init returned success [indus1:29331] select: initializing btl component self [indus1:29331] select: init returned success [indus3:13520] select: initializing btl component mx [indus3:13520] select: init returned failure [indus3:13520] select: module mx unloaded [indus3:13520] select: initializing btl component sm [indus3:13520] select: init returned success [indus3:13520] select: initializing btl component self [indus3:13520] select: init returned success [indus4:15486] select: initializing btl component mx [indus4:15486] select: init returned failure [indus4:15486] select: module mx unloaded [indus4:15486] select: initializing btl component sm [indus4:15486] select: init returned success [indus4:15486] select: initializing btl component self [indus4:15486] select: init returned success [indus2:11351] select: initializing btl component mx [indus2:11351] select: init returned failure [indus2:11351] select: module mx unloaded [indus2:11351] select: initializing btl component sm [indus2:11351] select: init returned success [indus2:11351] select: initializing btl component self [indus2:11351] select: init returned success -- Process 0.1.0 is unable to reach 0.1.1 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) -- Process 0.1.1 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- Process 0.1.2 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT-- Process 0.1.3 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel
Re: [OMPI users] OpenMPI Giving problems when using -mca btl mx, sm, self
Hi Terry, Thanks for replying. The following command is working fine: /opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca btl tcp,sm,self -machinefile machines ./hello The contents of machines are: indus1 indus2 indus3 indus4 I have tried using np=2 over pairs of machines, but the problem is same. The errors that occur are given below with the command that I am trying. **Test 1** /opt/SUNWhpc/HPC7.0/bin/mpirun -np 2 -mca btl mx,sm,self -host "indus1,indus2" ./hello -- Process 0.1.1 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) -- Process 0.1.0 is unable to reach 0.1.1 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) **Test 2* */opt/SUNWhpc/HPC7.0/bin/mpirun -np 2 -mca btl mx,sm,self -host "indus1,indus3" ./hello -- Process 0.1.0 is unable to reach 0.1.1 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) -- Process 0.1.1 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) * *Test 3* */opt/SUNWhpc/HPC7.0/bin/mpirun -np 2 -mca btl mx,sm,self -host "indus1,indus4" ./hello -- Process 0.1.0 is unable to reach 0.1.1 for MPI communication. If you specified
[OMPI users] OpenMPI Giving problems when using -mca btl mx, sm, self
Hello, I am using Sun HPC Toolkit 7.0 to compile and run my C MPI programs. I have tested the myrinet installations using myricoms own test programs. The Myricom software stack I am using is MX and the vesrion is mx2g-1.1.7, mx_mapper is also used. We have 4 nodes having 8 dual core processors each (Sun Fire v890) and the operating system is Solaris 10 (SunOS indus1 5.10 Generic_125100-10 sun4u sparc SUNW,Sun-Fire-V890). The contents of machine file are: indus1 indus2 indus3 indus4 The output of *mx_info* on each node is given below =*= indus1 *== MX Version: 1.1.7rc3cvs1_1_fixes MX Build: @indus4:/opt/mx2g-1.1.7rc3 Thu May 31 11:36:59 PKT 2007 2 Myrinet boards installed. The MX driver is configured to support up to 4 instances and 1024 nodes. === Instance #0: 333.2 MHz LANai, 66.7 MHz PCI bus, 2 MB SRAM Status: Running, P0: Link up MAC Address:00:60:dd:47:ad:7c Product code: M3F-PCIXF-2 Part number:09-03392 Serial number: 297218 Mapper: 00:60:dd:47:b3:e8, version = 0x7677b8ba, configured Mapped hosts: 10 ROUTE COUNT INDEXMAC ADDRESS HOST NAMEP0 ---- ---- 0) 00:60:dd:47:ad:7c indus1:0 1,1 2) 00:60:dd:47:ad:68 indus4:0 8,3 3) 00:60:dd:47:b3:e8 indus4:1 7,3 4) 00:60:dd:47:b3:ab indus2:0 7,3 5) 00:60:dd:47:ad:66 indus3:0 8,3 6) 00:60:dd:47:ad:76 indus3:1 8,3 7) 00:60:dd:47:ad:77 jhelum1:0 8,3 8) 00:60:dd:47:b3:5a ravi2:0 8,3 9) 00:60:dd:47:ad:5f ravi2:1 1,1 10) 00:60:dd:47:b3:bf ravi1:0 8,3 === == *indus2* == MX Version: 1.1.7rc3cvs1_1_fixes MX Build: @indus2:/opt/mx2g-1.1.7rc3 Thu May 31 11:24:03 PKT 2007 2 Myrinet boards installed. The MX driver is configured to support up to 4 instances and 1024 nodes. === Instance #0: 333.2 MHz LANai, 66.7 MHz PCI bus, 2 MB SRAM Status: Running, P0: Link up MAC Address:00:60:dd:47:b3:ab Product code: M3F-PCIXF-2 Part number:09-03392 Serial number: 296636 Mapper: 00:60:dd:47:b3:e8, version = 0x7677b8ba, configured Mapped hosts: 10 ROUTE COUNT INDEXMAC ADDRESS HOST NAMEP0 ---- ---- 0) 00:60:dd:47:b3:ab indus2:0 1,1 2) 00:60:dd:47:ad:68 indus4:0 1,1 3) 00:60:dd:47:b3:e8 indus4:1 8,3 4) 00:60:dd:47:ad:66 indus3:0 1,1 5) 00:60:dd:47:ad:76 indus3:1 7,3 6) 00:60:dd:47:ad:77 jhelum1:0 7,3 8) 00:60:dd:47:ad:7c indus1:0 8,3 9) 00:60:dd:47:b3:5a ravi2:0 8,3 10) 00:60:dd:47:ad:5f ravi2:1 8,3 11) 00:60:dd:47:b3:bf ravi1:0 7,3 === Instance #1: 333.2 MHz LANai, 66.7 MHz PCI bus, 2 MB SRAM Status: Running, P0: Link down MAC Address:00:60:dd:47:b3:c3 Product code: M3F-PCIXF-2 Part number:09-03392 Serial number: 296612 Mapper: 00:60:dd:47:b3:e8, version = 0x7677b8ba, configured Mapped hosts: 10 == *indus3* == MX Version: 1.1.7rc3cvs1_1_fixes MX Build: @indus3:/opt/mx2g-1.1.7rc3 Thu May 31 11:29:03 PKT 2007 2 Myrinet boards installed. The MX driver is configured to support up to 4 instances and 1024 nodes. === Instance #0: 333.2 MHz LANai, 66.7 MHz PCI bus, 2 MB SRAM Status: Running, P0: Link up MAC Address:00:60:dd:47:ad:66 Product code: M3F-PCIXF-2 Part number:09-03392 Serial number: 297240 Mapper: 00:60:dd:47:b3:e8, version = 0x7677b8ba, configured Mapped hosts: 10 ROUTE COUNT INDEXMAC ADDRESS HOST NAMEP0 ---- ---- 0) 00:60:dd:47:ad:66 indus3:0 1,1 1)