Re: [OMPI users] Fw: system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen
Aha! I may have been stupid.

The perl program is calling another small openMPI routine via mpirun and 
system()
That is bad isn't it?

How can MPI tell that a program 2 system calls away is MPI ?
A better question is how can I trick it to not knowing that its MPI so it runs 
just as it does when started manually ?
Doh !


 From: Randolph Pullen <randolph_pul...@yahoo.com.au>
To: Ralph Castain <r...@open-mpi.org>; Open MPI Users <us...@open-mpi.org> 
Sent: Friday, 20 January 2012 2:17 PM
Subject: Re: [OMPI users] Fw:  system() call corrupts MPI processes
 

Removing the redirection to the log makes no difference.

Running the script externally is fine (method #1).  Only occurs when the perl 
is started by the OpenMPI process (method #2)
Both methods open a TCP socket and both methods have the perl do a(nother) 
system call

Appart from MPI starting the perl both methods are identical.

BTW - the perl is the server the openMPI is the client.



 From: Ralph Castain <r...@open-mpi.org>
To: Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
<us...@open-mpi.org> 
Sent: Friday, 20 January 2012 1:57 PM
Subject: Re: [OMPI users] Fw:  system() call corrupts MPI processes
 

That is bizarre. Afraid you have me stumped here - I can't think why an action 
in the perl script would trigger an action in OMPI. If your OMPI proc doesn't 
in any way read the info in "log" (using your example), does it still have a 
problem? In other words, if the perl script still executes a system command, 
but the OMPI proc doesn't interact with it in any way, does the problem persist?

What I'm searching for is the connection. If your OMPI proc reads the results 
of that system command, then it's possible that something in your app is 
corrupting memory during the read operation - e.g., you are reading in more 
info than you have allocated memory.




On Jan 19, 2012, at 7:33 PM, Randolph Pullen wrote:

FYI
>- Forwarded Message -
>From: Randolph Pullen <randolph_pul...@yahoo.com.au>
>To: Jeff Squyres <jsquy...@cisco.com> 
>Sent: Friday, 20 January 2012 12:45 PM
>Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
>
>I'm using TCP on 1.4.1 (its actually IPoIB)
>OpenIB is compiled in.
>Note that these nodes are containers running in OpenVZ where IB is not 
>available.  there may be some SDP running in system level routines on the VH 
>but this is unlikely.
>OpenIB is not available to the VMs.  they happily get TCP services from the VH
>In any case, the problem still occurs if I use: --mca btl tcp,self
>
>
>I have traced the perl code and observed that OpenMPI gets confused whenever 
>the perl program executes a system command itself
>eg:
>`command 2>&1 1> log`;
>
>
>This probably narrows it down (I hope)
>
>
> From: Jeff Squyres <jsquy...@cisco.com>
>To: Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
><us...@open-mpi.org> 
>Sent: Friday, 20 January 2012 1:52 AM
>Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
>Which network transport are you using, and what version of Open MPI are you 
>using?  Do you have OpenFabrics support compiled into your Open MPI 
>installation?
>
>If you're just using TCP and/or shared memory, I can't think of a reason 
>immediately as to why this wouldn't work, but there may be a subtle 
>interaction in there
 somewhere that causes badness (e.g., memory corruption).
>
>
>On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
>
>> 
>> I have a section in my code running in rank 0 that must start a perl program 
>> that it then connects to via a tcp socket.
>> The initialisation section is shown here:
>> 
>>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>>     int i = system(buf);
>>     printf("system returned %d\n", i);
>> 
>> 
>> Some time after I run this code, while waiting for the data from the perl 
>> program, the error below occurs:
>> 
>> qplan connection
>> DCsession_fetch: waiting for Mcode data...
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file rml_oob_send.c at 
>> line 105
>> [dc1:05387] [[40050,1],0] could not get route to
 [[INVALID],INVALID]
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file 
>> base/plm_base_proxy.c at line 86
>> 
>> 
>> It seems that the linux system() call is breaking OpenMPI internal 
>> connections.  Removing the system() call 

Re: [OMPI users] Fw: system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen
Removing the redirection to the log makes no difference.

Running the script externally is fine (method #1).  Only occurs when the perl 
is started by the OpenMPI process (method #2)
Both methods open a TCP socket and both methods have the perl do a(nother) 
system call

Appart from MPI starting the perl both methods are identical.

BTW - the perl is the server the openMPI is the client.



 From: Ralph Castain <r...@open-mpi.org>
To: Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
<us...@open-mpi.org> 
Sent: Friday, 20 January 2012 1:57 PM
Subject: Re: [OMPI users] Fw:  system() call corrupts MPI processes
 

That is bizarre. Afraid you have me stumped here - I can't think why an action 
in the perl script would trigger an action in OMPI. If your OMPI proc doesn't 
in any way read the info in "log" (using your example), does it still have a 
problem? In other words, if the perl script still executes a system command, 
but the OMPI proc doesn't interact with it in any way, does the problem persist?

What I'm searching for is the connection. If your OMPI proc reads the results 
of that system command, then it's possible that something in your app is 
corrupting memory during the read operation - e.g., you are reading in more 
info than you have allocated memory.




On Jan 19, 2012, at 7:33 PM, Randolph Pullen wrote:

FYI
>- Forwarded Message -
>From: Randolph Pullen <randolph_pul...@yahoo.com.au>
>To: Jeff Squyres <jsquy...@cisco.com> 
>Sent: Friday, 20 January 2012 12:45 PM
>Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
>
>I'm using TCP on 1.4.1 (its actually IPoIB)
>OpenIB is compiled in.
>Note that these nodes are containers running in OpenVZ where IB is not 
>available.  there may be some SDP running in system level routines on the VH 
>but this is unlikely.
>OpenIB is not available to the VMs.  they happily get TCP services from the VH
>In any case, the problem still occurs if I use: --mca btl tcp,self
>
>
>I have traced the perl code and observed that OpenMPI gets confused whenever 
>the perl program executes a system command itself
>eg:
>`command 2>&1 1> log`;
>
>
>This probably narrows it down (I hope)
>
>
> From: Jeff Squyres <jsquy...@cisco.com>
>To: Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
><us...@open-mpi.org> 
>Sent: Friday, 20 January 2012 1:52 AM
>Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
>Which network transport are you using, and what version of Open MPI are you 
>using?  Do you have OpenFabrics support compiled into your Open MPI 
>installation?
>
>If you're just using TCP and/or shared memory, I can't think of a reason 
>immediately as to why this wouldn't work, but there may be a subtle 
>interaction in there
 somewhere that causes badness (e.g., memory corruption).
>
>
>On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
>
>> 
>> I have a section in my code running in rank 0 that must start a perl program 
>> that it then connects to via a tcp socket.
>> The initialisation section is shown here:
>> 
>>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>>     int i = system(buf);
>>     printf("system returned %d\n", i);
>> 
>> 
>> Some time after I run this code, while waiting for the data from the perl 
>> program, the error below occurs:
>> 
>> qplan connection
>> DCsession_fetch: waiting for Mcode data...
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file rml_oob_send.c at 
>> line 105
>> [dc1:05387] [[40050,1],0] could not get route to
 [[INVALID],INVALID]
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file 
>> base/plm_base_proxy.c at line 86
>> 
>> 
>> It seems that the linux system() call is breaking OpenMPI internal 
>> connections.  Removing the system() call and executing the perl code 
>> externaly fixes the problem but I can't go into production like that as its 
>> a security issue.
>> 
>> Any ideas ?
>> 
>> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
>> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>-- 
>Jeff
 Squyres
>jsquy...@cisco.com
>For corporate legal information go to:
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Fw: system() call corrupts MPI processes

2012-01-19 Thread Ralph Castain
That is bizarre. Afraid you have me stumped here - I can't think why an action 
in the perl script would trigger an action in OMPI. If your OMPI proc doesn't 
in any way read the info in "log" (using your example), does it still have a 
problem? In other words, if the perl script still executes a system command, 
but the OMPI proc doesn't interact with it in any way, does the problem persist?

What I'm searching for is the connection. If your OMPI proc reads the results 
of that system command, then it's possible that something in your app is 
corrupting memory during the read operation - e.g., you are reading in more 
info than you have allocated memory.


On Jan 19, 2012, at 7:33 PM, Randolph Pullen wrote:

> FYI
> - Forwarded Message -
> From: Randolph Pullen 
> To: Jeff Squyres  
> Sent: Friday, 20 January 2012 12:45 PM
> Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
> I'm using TCP on 1.4.1 (its actually IPoIB)
> OpenIB is compiled in.
> Note that these nodes are containers running in OpenVZ where IB is not 
> available.  there may be some SDP running in system level routines on the VH 
> but this is unlikely.
> OpenIB is not available to the VMs.  they happily get TCP services from the VH
> In any case, the problem still occurs if I use: --mca btl tcp,self
> 
> I have traced the perl code and observed that OpenMPI gets confused whenever 
> the perl program executes a system command itself
> eg:
> `command 2>&1 1> log`;
> 
> This probably narrows it down (I hope)
> From: Jeff Squyres 
> To: Randolph Pullen ; Open MPI Users 
>  
> Sent: Friday, 20 January 2012 1:52 AM
> Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
> Which network transport are you using, and what version of Open MPI are you 
> using?  Do you have OpenFabrics support compiled into your Open MPI 
> installation?
> 
> If you're just using TCP and/or shared memory, I can't think of a reason 
> immediately as to why this wouldn't work, but there may be a subtle 
> interaction in there somewhere that causes badness (e.g., memory corruption).
> 
> 
> On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
> 
> > 
> > I have a section in my code running in rank 0 that must start a perl 
> > program that it then connects to via a tcp socket.
> > The initialisation section is shown here:
> > 
> >sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
> >int i = system(buf);
> >printf("system returned %d\n", i);
> > 
> > 
> > Some time after I run this code, while waiting for the data from the perl 
> > program, the error below occurs:
> > 
> > qplan connection
> > DCsession_fetch: waiting for Mcode data...
> > [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be 
> > sent to a process whose contact information is unknown in file 
> > rml_oob_send.c at line 105
> > [dc1:05387] [[40050,1],0] could not get route to [[INVALID],INVALID]
> > [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be 
> > sent to a process whose contact information is unknown in file 
> > base/plm_base_proxy.c at line 86
> > 
> > 
> > It seems that the linux system() call is breaking OpenMPI internal 
> > connections.  Removing the system() call and executing the perl code 
> > externaly fixes the problem but I can't go into production like that as its 
> > a security issue.
> > 
> > Any ideas ?
> > 
> > (environment: OpenMPI 1.4.1 on kernel Linux dc1 
> > 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] Fw: system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen


FYI
- Forwarded Message -
From: Randolph Pullen 
To: Jeff Squyres  
Sent: Friday, 20 January 2012 12:45 PM
Subject: Re: [OMPI users] system() call corrupts MPI processes
 

I'm using TCP on 1.4.1 (its actually IPoIB)
OpenIB is compiled in.
Note that these nodes are containers running in OpenVZ where IB is not 
available.  there may be some SDP running in system level routines on the VH 
but this is unlikely.
OpenIB is not available to the VMs.  they happily get TCP services from the VH
In any case, the problem still occurs if I use: --mca btl tcp,self

I have traced the perl code and observed that OpenMPI gets confused whenever 
the perl program executes a system command itself
eg:
`command 2>&1 1> log`;

This probably narrows it down (I hope)


 From: Jeff Squyres 
To: Randolph Pullen ; Open MPI Users 
 
Sent: Friday, 20 January 2012 1:52 AM
Subject: Re: [OMPI users] system() call corrupts MPI processes
 
Which network transport are you using, and what version of Open MPI are you 
using?  Do you have OpenFabrics support compiled into your Open MPI 
installation?

If you're just using TCP and/or shared memory, I can't think of a reason 
immediately as to why this wouldn't work, but there may be a subtle interaction 
in there
 somewhere that causes badness (e.g., memory corruption).


On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:

> 
> I have a section in my code running in rank 0 that must start a perl program 
> that it then connects to via a tcp socket.
> The initialisation section is shown here:
> 
>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>     int i = system(buf);
>     printf("system returned %d\n", i);
> 
> 
> Some time after I run this code, while waiting for the data from the perl 
> program, the error below occurs:
> 
> qplan connection
> DCsession_fetch: waiting for Mcode data...
> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
> to a process whose contact information is unknown in file rml_oob_send.c at 
> line 105
> [dc1:05387] [[40050,1],0] could not get route to
 [[INVALID],INVALID]
> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
> to a process whose contact information is unknown in file 
> base/plm_base_proxy.c at line 86
> 
> 
> It seems that the linux system() call is breaking OpenMPI internal 
> connections.  Removing the system() call and executing the perl code 
> externaly fixes the problem but I can't go into production like that as its a 
> security issue.
> 
> Any ideas ?
> 
> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff
 Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/