[OMPI users] RE : Unable to connect to a server using MX MTL with TCP

2010-06-04 Thread Audet, Martin
Sorry,

I forgot the attachements...

Martin


De : users-boun...@open-mpi.org [users-boun...@open-mpi.org] de la part de 
Audet, Martin [martin.au...@imi.cnrc-nrc.gc.ca]
Date d'envoi : 4 juin 2010 19:18
À : us...@open-mpi.org
Objet : [OMPI users] Unable to connect to a server using MX MTL with TCP

Hi OpenMPI_Users and OpenMPI_Developers,

I'm unable to connect a client application using MPI_Comm_connect() to a server 
job (the server job calls MPI_Open_port() before calling by MPI_Comm_accept()) 
when the server job uses MX MTL (although it works without problems when the 
server uses MX BTL). The server job runs on a cluster connected to a Myrinet 
10G network (MX 1.2.11) in addition to an ordinary Ethernet network. The client 
runs on a different machine, not connected to the Myrinet network but 
accessible via the Ethernet network.

Joined to this message are the simple server and client programs (87 lines 
total) called simpleserver.c and simpleclient.c .

Note we are using OpenMPI 1.4.2 on x86_64 Linux  (server: Fedora 7 client: 
Fedora 12).

Compiling these programs with mpicc on the server front node (fn1) and client 
workstation (linux15) works well:

   [audet@fn1 bench]$ mpicc simpleserver.c -o simpleserver

   [audet@linux15 mpi]$ mpicc simpleclient.c -o simpleclient

Then if we start the server on the cluster (job is started on cluster node 
cn18) and asking to use MTL :

   [audet@fn1 bench]$ mpiexec -x MX_RCACHE=2 -machinefile machinefile_cn18 
--mca mtl mx --mca pml cm -n 1 ./simpleserver

It prints the server port (Note we uses MX_RCACHE=2 to avoid a warning but it 
doesn't affect the current issue) :

   Server port = 
'3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300'

Then starting the client on the workstation with this port number:

   [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient 
'3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300'

The server process core dump as follow:

   MPI_Comm_accept() sucessful...
   [cn18:24582] *** Process received signal ***
   [cn18:24582] Signal: Segmentation fault (11)
   [cn18:24582] Signal code: Address not mapped (1)
   [cn18:24582] Failing at address: 0x38
   [cn18:24582] [ 0] /lib64/libpthread.so.0 [0x305de0dd20]
   [cn18:24582] [ 1] /usr/local/openmpi-1.4.2/lib/openmpi/mca_mtl_mx.so 
[0x2d6a7e6d]
   [cn18:24582] [ 2] /usr/local/openmpi-1.4.2/lib/openmpi/mca_pml_cm.so 
[0x2d4a319d]
   [cn18:24582] [ 3] 
/usr/local/openmpi/lib/libmpi.so.0(ompi_dpm_base_disconnect_init+0xbf) 
[0x2ab1403f]
   [cn18:24582] [ 4] /usr/local/openmpi-1.4.2/lib/openmpi/mca_dpm_orte.so 
[0x2ed0eb19]
   [cn18:24582] [ 5] 
/usr/local/openmpi/lib/libmpi.so.0(PMPI_Comm_disconnect+0xa0) [0x2aaf4f20]
   [cn18:24582] [ 6] ./simpleserver(main+0x14c) [0x400d04]
   [cn18:24582] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x305ce1daa4]
   [cn18:24582] [ 8] ./simpleserver [0x400b09]
   [cn18:24582] *** End of error message ***
   --
   mpiexec noticed that process rank 0 with PID 24582 on node cn18 exited on 
signal 11 (Segmentation fault).
   --
   [audet@fn1 bench]$

And the client stops with the following error message:

   --
   At least one pair of MPI processes are unable to reach each other for
   MPI communications.  This means that no Open MPI device has indicated
   that it can be used to communicate between these processes.  This is
   an error; Open MPI requires that all MPI processes be able to reach
   each other.  This error can sometimes be the result of forgetting to
   specify the "self" BTL.

 Process 1 ([[31386,1],0]) is on host: linux15
 Process 2 ([[54152,1],0]) is on host: cn18
 BTLs attempted: self sm tcp

   Your MPI job is now going to abort; sorry.
   --
   MPI_Comm_connect() sucessful...
   Error in comm_disconnect_waitall
   [audet@linux15 mpi]$

I really don't understand this message because the client can connect with the 
server using tcp on Ethernet.

Moreover if I add MCA options when I start the server to include TCP BTL, the 
same problems happens (the argument list then becomes: '--mca mtl mx --mca pml 
cm --mca btl tcp,shared,self' ).

However if I remove all MCA options when I start the server (e.g. when BTL MX 
is used), no such problems appears. Everything goes fine also if I start the 
server with an explicit request to use BTL MX and TCP (e.g. with options '--mca 
btl mx,tcp,sm,self').

Four running our server application we really prefer to use MX MTL over MX BTL 
since it is much faster with MTL (although the usual ping pong test is only 
slightly faster with MTL).

Enclosed also the output of ompi_info --all runned on 

[OMPI users] Unable to connect to a server using MX MTL with TCP

2010-06-04 Thread Audet, Martin
Hi OpenMPI_Users and OpenMPI_Developers,

I'm unable to connect a client application using MPI_Comm_connect() to a server 
job (the server job calls MPI_Open_port() before calling by MPI_Comm_accept()) 
when the server job uses MX MTL (although it works without problems when the 
server uses MX BTL). The server job runs on a cluster connected to a Myrinet 
10G network (MX 1.2.11) in addition to an ordinary Ethernet network. The client 
runs on a different machine, not connected to the Myrinet network but 
accessible via the Ethernet network.

Joined to this message are the simple server and client programs (87 lines 
total) called simpleserver.c and simpleclient.c .

Note we are using OpenMPI 1.4.2 on x86_64 Linux  (server: Fedora 7 client: 
Fedora 12).

Compiling these programs with mpicc on the server front node (fn1) and client 
workstation (linux15) works well:

   [audet@fn1 bench]$ mpicc simpleserver.c -o simpleserver

   [audet@linux15 mpi]$ mpicc simpleclient.c -o simpleclient

Then if we start the server on the cluster (job is started on cluster node 
cn18) and asking to use MTL :

   [audet@fn1 bench]$ mpiexec -x MX_RCACHE=2 -machinefile machinefile_cn18 
--mca mtl mx --mca pml cm -n 1 ./simpleserver

It prints the server port (Note we uses MX_RCACHE=2 to avoid a warning but it 
doesn't affect the current issue) :

   Server port = 
'3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300'

Then starting the client on the workstation with this port number:

   [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient 
'3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300'

The server process core dump as follow:

   MPI_Comm_accept() sucessful...
   [cn18:24582] *** Process received signal ***
   [cn18:24582] Signal: Segmentation fault (11)
   [cn18:24582] Signal code: Address not mapped (1)
   [cn18:24582] Failing at address: 0x38
   [cn18:24582] [ 0] /lib64/libpthread.so.0 [0x305de0dd20]
   [cn18:24582] [ 1] /usr/local/openmpi-1.4.2/lib/openmpi/mca_mtl_mx.so 
[0x2d6a7e6d]
   [cn18:24582] [ 2] /usr/local/openmpi-1.4.2/lib/openmpi/mca_pml_cm.so 
[0x2d4a319d]
   [cn18:24582] [ 3] 
/usr/local/openmpi/lib/libmpi.so.0(ompi_dpm_base_disconnect_init+0xbf) 
[0x2ab1403f]
   [cn18:24582] [ 4] /usr/local/openmpi-1.4.2/lib/openmpi/mca_dpm_orte.so 
[0x2ed0eb19]
   [cn18:24582] [ 5] 
/usr/local/openmpi/lib/libmpi.so.0(PMPI_Comm_disconnect+0xa0) [0x2aaf4f20]
   [cn18:24582] [ 6] ./simpleserver(main+0x14c) [0x400d04]
   [cn18:24582] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x305ce1daa4]
   [cn18:24582] [ 8] ./simpleserver [0x400b09]
   [cn18:24582] *** End of error message ***
   --
   mpiexec noticed that process rank 0 with PID 24582 on node cn18 exited on 
signal 11 (Segmentation fault).
   --
   [audet@fn1 bench]$

And the client stops with the following error message:

   --
   At least one pair of MPI processes are unable to reach each other for
   MPI communications.  This means that no Open MPI device has indicated
   that it can be used to communicate between these processes.  This is
   an error; Open MPI requires that all MPI processes be able to reach
   each other.  This error can sometimes be the result of forgetting to
   specify the "self" BTL.

 Process 1 ([[31386,1],0]) is on host: linux15
 Process 2 ([[54152,1],0]) is on host: cn18
 BTLs attempted: self sm tcp

   Your MPI job is now going to abort; sorry.
   --
   MPI_Comm_connect() sucessful...
   Error in comm_disconnect_waitall
   [audet@linux15 mpi]$

I really don't understand this message because the client can connect with the 
server using tcp on Ethernet.

Moreover if I add MCA options when I start the server to include TCP BTL, the 
same problems happens (the argument list then becomes: '--mca mtl mx --mca pml 
cm --mca btl tcp,shared,self' ).

However if I remove all MCA options when I start the server (e.g. when BTL MX 
is used), no such problems appears. Everything goes fine also if I start the 
server with an explicit request to use BTL MX and TCP (e.g. with options '--mca 
btl mx,tcp,sm,self').

Four running our server application we really prefer to use MX MTL over MX BTL 
since it is much faster with MTL (although the usual ping pong test is only 
slightly faster with MTL).

Enclosed also the output of ompi_info --all runned on the cluster node (cn18) 
and the workstation (linux15).

Please help me. I think my problem is only a question of wrong MCA parameters 
(which is obscure for me).

Thanks,

Martin Audet, Research Officer
Industrial Material Institute
National Research Council of Canada
75 de Mortagne, Boucherville, QC, J4B 6Y4, Canada



Re: [OMPI users] SM failure with mixed 32/64-bit procs onthe samemachine

2010-06-04 Thread Jeff Squyres
On Jun 4, 2010, at 2:18 PM, Katz, Jacob wrote:

> This would be a quite serious limitation from my point of view. I'm a library 
> developer, and my library is used in heterogeneous environment. Since 32b 
> executables regularly work on 64b machines, they get intermixed by the users 
> with 64b executables on the same machine. Switching to another BTL would 
> incur serious performance issues...

You're really the first person to ask us for combined 32/64 bit *on the same 
machine*.

Just curious -- why would people still be compiling in 32 bit mode these days?

> I noticed an SM bug report that looks similar to mine and was reportedly 
> fixed in 1.4.2. I'm going to check that version. If it still fails, what 
> would be the effort to fix this?

No, that was for a different issue (32/64 bit *across different machines*) -- 
it won't fix this sm issue.  I doubt that any of us had really even thought 
about mixing 32/64 bit in the sm BTL before (I know I hadn't).  Indeed, we 
haven't had much demand for 32 bit support over the past few years (it's 
non-zero, but not large).

We try to guide OMPI's development by customer demand for features and 
platforms to support.  Although not a definitive measure, having only one 
person ask for a (potentially difficult to implement) feature is a good 
indicator that that's a feature only wanted/needed by a small number of users.  
FWIW, the 32/64 scenarios we've generally seen before have been for running an 
MPI job across multiple different flavors of hardware or OSs -- but we haven't 
seen much of that, either. 

All that being said, I'm *not* any kind of authoritative source of HPC 
knowledge that knows what every customer is doing -- for example, you obviously 
have a different perspective and viewpoint than me.  Can you give some kind of 
quantification about how important this kind of feature is to the general HPC 
community?  How many applications / users do this?  Do you know if other MPI 
implementations support it?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] Debug info on Darwin

2010-06-04 Thread Peter Thompson
We've had a couple of reports of users trying to debug with Open MPI and 
TotalView on Darwin and not being able to use the classic


mpirun -tv -np 4 ./foo

launch.  The typical problem shows up as something like

Can't find typedef for MPIR_PROCDESC

and then TotalView can't attach to the spawned processes.  While the 
Open MPI build may correctly compile the needed files with -g, the 
problem arises in that the DWARF info on Darwin is kept in the .o 
files.  If these files are kept around, we might be able to find that 
info and be happy debugging.  But if they are deleted after the build, 
or things are moved around, then we are unable to locate the .o files 
containing the debug info, and no one is pleased. 

It was suggested by our CTO that if these files were compiled as to 
produce STABS debug info, rather than DWARF, then the debug info would 
be copied into the executables and shared libraries, and we would then 
be able to debug with Open MPI without a problem.   I'm not sure if this 
is the best place to offer that suggestion, but I imagine it's not a bad 
place to start.  ;-)


Regards,
Peter Thompson



Re: [OMPI users] SM failure with mixed 32/64-bit procs on the samemachine

2010-06-04 Thread Katz, Jacob
This would be a quite serious limitation from my point of view. I'm a library 
developer, and my library is used in heterogeneous environment. Since 32b 
executables regularly work on 64b machines, they get intermixed by the users 
with 64b executables on the same machine. Switching to another BTL would incur 
serious performance issues...

I noticed an SM bug report that looks similar to mine and was reportedly fixed 
in 1.4.2. I'm going to check that version. If it still fails, what would be the 
effort to fix this?


Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: 
(8)-465-5726


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Jeff Squyres
Sent: Friday, June 04, 2010 17:26
To: Open MPI Users
Subject: Re: [OMPI users] SM failure with mixed 32/64-bit procs on the 
samemachine

I doubt that we have tested this kind of scenario much (specifically with 
shared memory).  I guess I'm not too surprised that it doesn't work -- to my 
knowledge, you're the first person to ask for heterogeneous *on the same 
server*.  As such, I don't know if we'll do much work to support it (there 
could be some gnarly issues with address ranges inside shared memory). 

But your point is noted that we should not hang/crash in such a scenario.  I'll 
file a bug to at least detect this scenario and indicate that we do not support 
it.



On Jun 3, 2010, at 10:29 AM, Katz, Jacob wrote:

> Hi,
> I have two processes, one a 32bit and another a 64bit, running on the same 
> 64bit machine. When running with TCP BTL everything works fine, however with 
> SM BTL it's not.
> In one application the processes just got stuck - one in Send and the other 
> in Recv. In another application I even saw a segfault inside the MPI 
> libraries in one of the processes.
>  
> Is such scenario officially supported by SM BTL?
>  
> Open MPI: 1.3.3
> Heterogeneous support: yes
>  
> Thanks.
> 
> Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: 
> (8)-465-5726
>  
> -
> Intel Israel (74) Limited
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
-
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.




Re: [OMPI users] SM failure with mixed 32/64-bit procs on the samemachine

2010-06-04 Thread Barrett, Brian W
Jeff -

Is indicating we don't support it really the right thing to do?  Given that SM 
should already have the proc data, it seems that setting the reachable bit to 
zero for the other process of different "architecture" is all that is required.

Brian

On Jun 4, 2010, at 8:26 AM, Jeff Squyres wrote:

> I doubt that we have tested this kind of scenario much (specifically with 
> shared memory).  I guess I'm not too surprised that it doesn't work -- to my 
> knowledge, you're the first person to ask for heterogeneous *on the same 
> server*.  As such, I don't know if we'll do much work to support it (there 
> could be some gnarly issues with address ranges inside shared memory). 
> 
> But your point is noted that we should not hang/crash in such a scenario.  
> I'll file a bug to at least detect this scenario and indicate that we do not 
> support it.
> 
> 
> 
> On Jun 3, 2010, at 10:29 AM, Katz, Jacob wrote:
> 
>> Hi,
>> I have two processes, one a 32bit and another a 64bit, running on the same 
>> 64bit machine. When running with TCP BTL everything works fine, however with 
>> SM BTL it’s not.
>> In one application the processes just got stuck – one in Send and the other 
>> in Recv. In another application I even saw a segfault inside the MPI 
>> libraries in one of the processes.
>> 
>> Is such scenario officially supported by SM BTL?
>> 
>> Open MPI: 1.3.3
>> Heterogeneous support: yes
>> 
>> Thanks.
>> 
>> Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: 
>> (8)-465-5726
>> 
>> -
>> Intel Israel (74) Limited
>> 
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 





Re: [OMPI users] SM failure with mixed 32/64-bit procs on the samemachine

2010-06-04 Thread Jeff Squyres
I doubt that we have tested this kind of scenario much (specifically with 
shared memory).  I guess I'm not too surprised that it doesn't work -- to my 
knowledge, you're the first person to ask for heterogeneous *on the same 
server*.  As such, I don't know if we'll do much work to support it (there 
could be some gnarly issues with address ranges inside shared memory). 

But your point is noted that we should not hang/crash in such a scenario.  I'll 
file a bug to at least detect this scenario and indicate that we do not support 
it.



On Jun 3, 2010, at 10:29 AM, Katz, Jacob wrote:

> Hi,
> I have two processes, one a 32bit and another a 64bit, running on the same 
> 64bit machine. When running with TCP BTL everything works fine, however with 
> SM BTL it’s not.
> In one application the processes just got stuck – one in Send and the other 
> in Recv. In another application I even saw a segfault inside the MPI 
> libraries in one of the processes.
>  
> Is such scenario officially supported by SM BTL?
>  
> Open MPI: 1.3.3
> Heterogeneous support: yes
>  
> Thanks.
> 
> Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: 
> (8)-465-5726
>  
> -
> Intel Israel (74) Limited
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/