Re: [OMPI users] OpenMPI Giving problems when using -mca btl mx, sm, self

2007-10-02 Thread Tim Prins
Hi,

On Monday 01 October 2007 03:08:04 am Hammad Siddiqi wrote:
> One more thing to add -mca mtl mx uses ethernet and IP emulation of
> Myrinet to my knowledge. I want to use Myrinet(not its IP Emulation)
> and shared memory simultaneously.
This is not true (as far as I know...). Open MPI has 2 different network 
stacks, and we can use MX with either. See:
http://www.open-mpi.org/faq/?category=myrinet#myri-btl-mx

The mx mtl relies on the MX library for all communications, and the MX library 
itself does shared memory message passing. In my experience the mx mtl 
performs better than the mx,sm,self btl combination. However, I would 
encourage you to try both with your application and would be interested in 
hearing your opinion.


> > *1.  /opt/SUNWhpc/HPC7.0/bin/mpirun -np 2 -mca btl mx,sm,self  -host
> > "indus1,indus2" -mca btl_base_debug 1000 ./hello*
> >
> > /opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca btl mx,sm,self  -host
> > "indus1,indus2,indus3,indus4" -mca btl_base_debug 1000 ./hello
> > [indus1:29331] select: initializing btl component mx
> > [indus1:29331] select: init returned failure
> > [indus1:29331] select: module mx unloaded


So it looks like we are trying to load the mx library, but fail for some 
reason. Are you sure MX is working correctly? Can you run mx_pingpong between 
indus1 and indus2 as described here:
http://www.myri.com/cgi-bin/fom.pl?file=455=file%253D91

> > *2.1  /opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca mtl mx -host
> > "indus1,indus2,indus3,indus4" ./hello*
> >
> > This command works fine
Since you did not specify to use the cm pml (which MUST be done to use the mx 
mtl. see: http://www.open-mpi.org/faq/?category=myrinet#myri-btl-mx), you 
were probably actually using tcp for this run since we would automatically 
fail back after the mx btl fails to load.

> > *2.2 /opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca mtl mx -host
> > "indus1,indus2,indus3,indus4" -mca pml cm ./hello*
> >
> > This command works fine.
Good. So maybe there isn't anything wrong with your mx setup.

> > Also *"/opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca pml cm  -host
> > "indus1,indus2,indus3,indus4"  -mca mtl_base_debug 1000 ./hello"*,
> > this command works fine.
Since you selected the cm pml, we should be automatically using the mx mtl 
here.

> > but *"/opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca pml cm  -host
> > "indus1,indus2,indus3,indus4"  -mca mtl_base_debug 1000 ./hello"*
> > hangs for indefinite time.
Strange. I do not know why this would hang.

> > Also *"/opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca mtl mx,sm,self -host
> > "indus1,indus2,indus3,indus4"  -mca mtl_base_debug 1000 ./hello"*
> > works fine
Again, you are falling back to using the tcp btl here. BTW, the mtl 
string 'mx,sm,self' is bogus. There is no sm or self mtl's.

> >
> > *2.3 /opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca mtl mx -host
> > "indus1,indus2,indus3,indus4" -mca pml cm ./hello*
> >
> > This command hangs the machines for indefinite time.
> > Also *"/opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca mtl mx -host
> > "indus1,indus2,indus3,indus4" -mca pml cm  -mca mtl_base_debug 1000
> > ./hello"* hangs the
> > systems for indefinite time.
These two commands should have the exact same effect as the hang above.

> >
> > *2.4  /opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca mtl mx,sm,self -host
> > "indus1,indus2,indus3,indus4" -mca pml cm  -mca mtl_base_debug 1000
> > ./hello*
> >
> > This command hangs the machines for indefinite time.
Again, the mtl line here is bogus.

> >
> > Please notice that running more than four mpi processes hangs the
> > machines. Any suggestion please.
The first thing I would try is to see if a non-mpi application works. So try:
/opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -host "indus1,indus2,indus3,indus4" 
hostname

If that works, then try a simple MPI hello application that does no 
communication.

Tim

> >

> >
> > The output of *mx_info* on each node is given below
> >
> > =*=
> > indus1
> > *==
> >
> > MX Version: 1.1.7rc3cvs1_1_fixes
> > MX Build: @indus4:/opt/mx2g-1.1.7rc3 Thu May 31 11:36:59 PKT 2007
> > 2 Myrinet boards installed.
> > The MX driver is configured to support up to 4 instances and 1024
> > nodes.
> > ===
> > Instance #0: 333.2 MHz LANai, 66.7 MHz PCI bus, 2 MB SRAM
> > Status: Running, P0: Link up
> > MAC Address: 00:60:dd:47:ad:7c
> > Product code: M3F-PCIXF-2
> > Part number: 09-03392
> > Serial number: 297218
> > Mapper: 00:60:dd:47:b3:e8, version = 0x7677b8ba, configured
> > Mapped hosts: 10
> >
> >
> > ROUTE COUNT
> > INDEX MAC ADDRESS HOST NAME P0
> > - ---
> > - ---
> >0) 00:60:dd:47:ad:7c indus1:0 1,1
> >2) 00:60:dd:47:ad:68 indus4:0 8,3
> >3) 00:60:dd:47:b3:e8 indus4:1 7,3
> >4) 

Re: [OMPI users] OpenMPI Giving problems when using -mca btl mx, sm, self

2007-10-01 Thread Hammad Siddiqi

Dear Tim,

Your and Tim Matox's suggestion yielded following results,

*1.  /opt/SUNWhpc/HPC7.0/bin/mpirun -np 2 -mca btl mx,sm,self  -host 
"indus1,indus2" -mca btl_base_debug 1000 ./hello*


/opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca btl mx,sm,self  -host 
"indus1,indus2,indus3,indus4" -mca btl_base_debug 1000 ./hello

[indus1:29331] select: initializing btl component mx
[indus1:29331] select: init returned failure
[indus1:29331] select: module mx unloaded
[indus1:29331] select: initializing btl component sm
[indus1:29331] select: init returned success
[indus1:29331] select: initializing btl component self
[indus1:29331] select: init returned success
[indus3:13520] select: initializing btl component mx
[indus3:13520] select: init returned failure
[indus3:13520] select: module mx unloaded
[indus3:13520] select: initializing btl component sm
[indus3:13520] select: init returned success
[indus3:13520] select: initializing btl component self
[indus3:13520] select: init returned success
[indus4:15486] select: initializing btl component mx
[indus4:15486] select: init returned failure
[indus4:15486] select: module mx unloaded
[indus4:15486] select: initializing btl component sm
[indus4:15486] select: init returned success
[indus4:15486] select: initializing btl component self
[indus4:15486] select: init returned success
[indus2:11351] select: initializing btl component mx
[indus2:11351] select: init returned failure
[indus2:11351] select: module mx unloaded
[indus2:11351] select: initializing btl component sm
[indus2:11351] select: init returned success
[indus2:11351] select: initializing btl component self
[indus2:11351] select: init returned success
--
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
--
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
Process 0.1.2 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during 
MPI_INIT--

Process 0.1.3 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 ; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned "Unreachable" (-12) instead of "Success" (0)
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel 

Re: [OMPI users] OpenMPI Giving problems when using -mca btl mx, sm, self

2007-09-29 Thread Hammad Siddiqi

Hi Terry,

Thanks for replying. The following command is working fine:

/opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca btl tcp,sm,self  -machinefile 
machines ./hello


The contents of machines are:
indus1
indus2
indus3
indus4

I have tried using np=2 over pairs of machines, but the problem is same.
The errors that occur are given below with the command that I am trying.

**Test 1**

/opt/SUNWhpc/HPC7.0/bin/mpirun -np 2 -mca btl mx,sm,self  -host 
"indus1,indus2" ./hello

--
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
--
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)

**Test 2*

*/opt/SUNWhpc/HPC7.0/bin/mpirun -np 2 -mca btl mx,sm,self  -host 
"indus1,indus3" ./hello

--
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
--
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
*
*Test 3*
*/opt/SUNWhpc/HPC7.0/bin/mpirun -np 2 -mca btl mx,sm,self  -host 
"indus1,indus4" ./hello

--
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified 

[OMPI users] OpenMPI Giving problems when using -mca btl mx, sm, self

2007-09-28 Thread Hammad Siddiqi


Hello,

I am using Sun HPC Toolkit 7.0 to compile and run my C MPI programs.

I have tested the myrinet installations using myricoms own test programs.
The Myricom software stack I am using is MX and the vesrion is 
mx2g-1.1.7, mx_mapper is also used.
We have 4 nodes having 8 dual core processors each (Sun Fire v890) and 
the operating system is
Solaris 10 (SunOS indus1 5.10 Generic_125100-10 sun4u sparc 
SUNW,Sun-Fire-V890).


The contents of machine file are:
indus1
indus2
indus3
indus4

The output of *mx_info* on each node is given below

=*=
indus1
*==

MX Version: 1.1.7rc3cvs1_1_fixes
MX Build: @indus4:/opt/mx2g-1.1.7rc3 Thu May 31 11:36:59 PKT 2007
2 Myrinet boards installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===
Instance #0:  333.2 MHz LANai, 66.7 MHz PCI bus, 2 MB SRAM
   Status: Running, P0: Link up
   MAC Address:00:60:dd:47:ad:7c
   Product code:   M3F-PCIXF-2
   Part number:09-03392
   Serial number:  297218
   Mapper: 00:60:dd:47:b3:e8, version = 0x7677b8ba, configured
   Mapped hosts:   10


   ROUTE COUNT
INDEXMAC ADDRESS HOST NAMEP0
---- 
----

  0) 00:60:dd:47:ad:7c indus1:0  1,1
  2) 00:60:dd:47:ad:68 indus4:0  8,3
  3) 00:60:dd:47:b3:e8 indus4:1  7,3
  4) 00:60:dd:47:b3:ab indus2:0  7,3
  5) 00:60:dd:47:ad:66 indus3:0  8,3
  6) 00:60:dd:47:ad:76 indus3:1  8,3
  7) 00:60:dd:47:ad:77 jhelum1:0 8,3
  8) 00:60:dd:47:b3:5a ravi2:0   8,3
  9) 00:60:dd:47:ad:5f ravi2:1   1,1
 10) 00:60:dd:47:b3:bf ravi1:0   8,3
===

==
*indus2*
==

MX Version: 1.1.7rc3cvs1_1_fixes
MX Build: @indus2:/opt/mx2g-1.1.7rc3 Thu May 31 11:24:03 PKT 2007
2 Myrinet boards installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===
Instance #0:  333.2 MHz LANai, 66.7 MHz PCI bus, 2 MB SRAM
   Status: Running, P0: Link up
   MAC Address:00:60:dd:47:b3:ab
   Product code:   M3F-PCIXF-2
   Part number:09-03392
   Serial number:  296636
   Mapper: 00:60:dd:47:b3:e8, version = 0x7677b8ba, configured
   Mapped hosts:   10

   ROUTE COUNT
INDEXMAC ADDRESS HOST NAMEP0
---- ----
  0) 00:60:dd:47:b3:ab indus2:0  1,1
  2) 00:60:dd:47:ad:68 indus4:0  1,1
  3) 00:60:dd:47:b3:e8 indus4:1  8,3
  4) 00:60:dd:47:ad:66 indus3:0  1,1
  5) 00:60:dd:47:ad:76 indus3:1  7,3
  6) 00:60:dd:47:ad:77 jhelum1:0 7,3
  8) 00:60:dd:47:ad:7c indus1:0  8,3
  9) 00:60:dd:47:b3:5a ravi2:0   8,3
 10) 00:60:dd:47:ad:5f ravi2:1   8,3
 11) 00:60:dd:47:b3:bf ravi1:0   7,3
===
Instance #1:  333.2 MHz LANai, 66.7 MHz PCI bus, 2 MB SRAM
   Status: Running, P0: Link down
   MAC Address:00:60:dd:47:b3:c3
   Product code:   M3F-PCIXF-2
   Part number:09-03392
   Serial number:  296612
   Mapper: 00:60:dd:47:b3:e8, version = 0x7677b8ba, configured
   Mapped hosts:   10

==
*indus3*
==
MX Version: 1.1.7rc3cvs1_1_fixes
MX Build: @indus3:/opt/mx2g-1.1.7rc3 Thu May 31 11:29:03 PKT 2007
2 Myrinet boards installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===
Instance #0:  333.2 MHz LANai, 66.7 MHz PCI bus, 2 MB SRAM
   Status: Running, P0: Link up
   MAC Address:00:60:dd:47:ad:66
   Product code:   M3F-PCIXF-2
   Part number:09-03392
   Serial number:  297240
   Mapper: 00:60:dd:47:b3:e8, version = 0x7677b8ba, configured
   Mapped hosts:   10

   ROUTE COUNT
INDEXMAC ADDRESS HOST NAMEP0
---- ----
  0) 00:60:dd:47:ad:66 indus3:0  1,1
  1)