Re: [OMPI users] Fw: system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen
Aha! I may have been stupid.

The perl program is calling another small openMPI routine via mpirun and 
system()
That is bad isn't it?

How can MPI tell that a program 2 system calls away is MPI ?
A better question is how can I trick it to not knowing that its MPI so it runs 
just as it does when started manually ?
Doh !


 From: Randolph Pullen 
To: Ralph Castain ; Open MPI Users  
Sent: Friday, 20 January 2012 2:17 PM
Subject: Re: [OMPI users] Fw:  system() call corrupts MPI processes
 

Removing the redirection to the log makes no difference.

Running the script externally is fine (method #1).  Only occurs when the perl 
is started by the OpenMPI process (method #2)
Both methods open a TCP socket and both methods have the perl do a(nother) 
system call

Appart from MPI starting the perl both methods are identical.

BTW - the perl is the server the openMPI is the client.



 From: Ralph Castain 
To: Randolph Pullen ; Open MPI Users 
 
Sent: Friday, 20 January 2012 1:57 PM
Subject: Re: [OMPI users] Fw:  system() call corrupts MPI processes
 

That is bizarre. Afraid you have me stumped here - I can't think why an action 
in the perl script would trigger an action in OMPI. If your OMPI proc doesn't 
in any way read the info in "log" (using your example), does it still have a 
problem? In other words, if the perl script still executes a system command, 
but the OMPI proc doesn't interact with it in any way, does the problem persist?

What I'm searching for is the connection. If your OMPI proc reads the results 
of that system command, then it's possible that something in your app is 
corrupting memory during the read operation - e.g., you are reading in more 
info than you have allocated memory.




On Jan 19, 2012, at 7:33 PM, Randolph Pullen wrote:

FYI
>- Forwarded Message -
>From: Randolph Pullen 
>To: Jeff Squyres  
>Sent: Friday, 20 January 2012 12:45 PM
>Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
>
>I'm using TCP on 1.4.1 (its actually IPoIB)
>OpenIB is compiled in.
>Note that these nodes are containers running in OpenVZ where IB is not 
>available.  there may be some SDP running in system level routines on the VH 
>but this is unlikely.
>OpenIB is not available to the VMs.  they happily get TCP services from the VH
>In any case, the problem still occurs if I use: --mca btl tcp,self
>
>
>I have traced the perl code and observed that OpenMPI gets confused whenever 
>the perl program executes a system command itself
>eg:
>`command 2>&1 1> log`;
>
>
>This probably narrows it down (I hope)
>
>
> From: Jeff Squyres 
>To: Randolph Pullen ; Open MPI Users 
> 
>Sent: Friday, 20 January 2012 1:52 AM
>Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
>Which network transport are you using, and what version of Open MPI are you 
>using?  Do you have OpenFabrics support compiled into your Open MPI 
>installation?
>
>If you're just using TCP and/or shared memory, I can't think of a reason 
>immediately as to why this wouldn't work, but there may be a subtle 
>interaction in there
 somewhere that causes badness (e.g., memory corruption).
>
>
>On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
>
>> 
>> I have a section in my code running in rank 0 that must start a perl program 
>> that it then connects to via a tcp socket.
>> The initialisation section is shown here:
>> 
>>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>>     int i = system(buf);
>>     printf("system returned %d\n", i);
>> 
>> 
>> Some time after I run this code, while waiting for the data from the perl 
>> program, the error below occurs:
>> 
>> qplan connection
>> DCsession_fetch: waiting for Mcode data...
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file rml_oob_send.c at 
>> line 105
>> [dc1:05387] [[40050,1],0] could not get route to
 [[INVALID],INVALID]
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file 
>> base/plm_base_proxy.c at line 86
>> 
>> 
>> It seems that the linux system() call is breaking OpenMPI internal 
>> connections.  Removing the system() call and executing the perl code 
>> externaly fixes the problem but I can't go into production like that as its 
>> a security issue.
>> 
>> Any ideas ?
>> 
>> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
>> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
>> ___
>> users mailing list
>> us...@open-mpi.org
>> 

Re: [OMPI users] Fw: system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen
Removing the redirection to the log makes no difference.

Running the script externally is fine (method #1).  Only occurs when the perl 
is started by the OpenMPI process (method #2)
Both methods open a TCP socket and both methods have the perl do a(nother) 
system call

Appart from MPI starting the perl both methods are identical.

BTW - the perl is the server the openMPI is the client.



 From: Ralph Castain 
To: Randolph Pullen ; Open MPI Users 
 
Sent: Friday, 20 January 2012 1:57 PM
Subject: Re: [OMPI users] Fw:  system() call corrupts MPI processes
 

That is bizarre. Afraid you have me stumped here - I can't think why an action 
in the perl script would trigger an action in OMPI. If your OMPI proc doesn't 
in any way read the info in "log" (using your example), does it still have a 
problem? In other words, if the perl script still executes a system command, 
but the OMPI proc doesn't interact with it in any way, does the problem persist?

What I'm searching for is the connection. If your OMPI proc reads the results 
of that system command, then it's possible that something in your app is 
corrupting memory during the read operation - e.g., you are reading in more 
info than you have allocated memory.




On Jan 19, 2012, at 7:33 PM, Randolph Pullen wrote:

FYI
>- Forwarded Message -
>From: Randolph Pullen 
>To: Jeff Squyres  
>Sent: Friday, 20 January 2012 12:45 PM
>Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
>
>I'm using TCP on 1.4.1 (its actually IPoIB)
>OpenIB is compiled in.
>Note that these nodes are containers running in OpenVZ where IB is not 
>available.  there may be some SDP running in system level routines on the VH 
>but this is unlikely.
>OpenIB is not available to the VMs.  they happily get TCP services from the VH
>In any case, the problem still occurs if I use: --mca btl tcp,self
>
>
>I have traced the perl code and observed that OpenMPI gets confused whenever 
>the perl program executes a system command itself
>eg:
>`command 2>&1 1> log`;
>
>
>This probably narrows it down (I hope)
>
>
> From: Jeff Squyres 
>To: Randolph Pullen ; Open MPI Users 
> 
>Sent: Friday, 20 January 2012 1:52 AM
>Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
>Which network transport are you using, and what version of Open MPI are you 
>using?  Do you have OpenFabrics support compiled into your Open MPI 
>installation?
>
>If you're just using TCP and/or shared memory, I can't think of a reason 
>immediately as to why this wouldn't work, but there may be a subtle 
>interaction in there
 somewhere that causes badness (e.g., memory corruption).
>
>
>On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
>
>> 
>> I have a section in my code running in rank 0 that must start a perl program 
>> that it then connects to via a tcp socket.
>> The initialisation section is shown here:
>> 
>>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>>     int i = system(buf);
>>     printf("system returned %d\n", i);
>> 
>> 
>> Some time after I run this code, while waiting for the data from the perl 
>> program, the error below occurs:
>> 
>> qplan connection
>> DCsession_fetch: waiting for Mcode data...
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file rml_oob_send.c at 
>> line 105
>> [dc1:05387] [[40050,1],0] could not get route to
 [[INVALID],INVALID]
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file 
>> base/plm_base_proxy.c at line 86
>> 
>> 
>> It seems that the linux system() call is breaking OpenMPI internal 
>> connections.  Removing the system() call and executing the perl code 
>> externaly fixes the problem but I can't go into production like that as its 
>> a security issue.
>> 
>> Any ideas ?
>> 
>> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
>> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>-- 
>Jeff
 Squyres
>jsquy...@cisco.com
>For corporate legal information go to:
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Fw: system() call corrupts MPI processes

2012-01-19 Thread Ralph Castain
That is bizarre. Afraid you have me stumped here - I can't think why an action 
in the perl script would trigger an action in OMPI. If your OMPI proc doesn't 
in any way read the info in "log" (using your example), does it still have a 
problem? In other words, if the perl script still executes a system command, 
but the OMPI proc doesn't interact with it in any way, does the problem persist?

What I'm searching for is the connection. If your OMPI proc reads the results 
of that system command, then it's possible that something in your app is 
corrupting memory during the read operation - e.g., you are reading in more 
info than you have allocated memory.


On Jan 19, 2012, at 7:33 PM, Randolph Pullen wrote:

> FYI
> - Forwarded Message -
> From: Randolph Pullen 
> To: Jeff Squyres  
> Sent: Friday, 20 January 2012 12:45 PM
> Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
> I'm using TCP on 1.4.1 (its actually IPoIB)
> OpenIB is compiled in.
> Note that these nodes are containers running in OpenVZ where IB is not 
> available.  there may be some SDP running in system level routines on the VH 
> but this is unlikely.
> OpenIB is not available to the VMs.  they happily get TCP services from the VH
> In any case, the problem still occurs if I use: --mca btl tcp,self
> 
> I have traced the perl code and observed that OpenMPI gets confused whenever 
> the perl program executes a system command itself
> eg:
> `command 2>&1 1> log`;
> 
> This probably narrows it down (I hope)
> From: Jeff Squyres 
> To: Randolph Pullen ; Open MPI Users 
>  
> Sent: Friday, 20 January 2012 1:52 AM
> Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
> Which network transport are you using, and what version of Open MPI are you 
> using?  Do you have OpenFabrics support compiled into your Open MPI 
> installation?
> 
> If you're just using TCP and/or shared memory, I can't think of a reason 
> immediately as to why this wouldn't work, but there may be a subtle 
> interaction in there somewhere that causes badness (e.g., memory corruption).
> 
> 
> On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
> 
> > 
> > I have a section in my code running in rank 0 that must start a perl 
> > program that it then connects to via a tcp socket.
> > The initialisation section is shown here:
> > 
> >sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
> >int i = system(buf);
> >printf("system returned %d\n", i);
> > 
> > 
> > Some time after I run this code, while waiting for the data from the perl 
> > program, the error below occurs:
> > 
> > qplan connection
> > DCsession_fetch: waiting for Mcode data...
> > [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be 
> > sent to a process whose contact information is unknown in file 
> > rml_oob_send.c at line 105
> > [dc1:05387] [[40050,1],0] could not get route to [[INVALID],INVALID]
> > [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be 
> > sent to a process whose contact information is unknown in file 
> > base/plm_base_proxy.c at line 86
> > 
> > 
> > It seems that the linux system() call is breaking OpenMPI internal 
> > connections.  Removing the system() call and executing the perl code 
> > externaly fixes the problem but I can't go into production like that as its 
> > a security issue.
> > 
> > Any ideas ?
> > 
> > (environment: OpenMPI 1.4.1 on kernel Linux dc1 
> > 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] Fw: system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen


FYI
- Forwarded Message -
From: Randolph Pullen 
To: Jeff Squyres  
Sent: Friday, 20 January 2012 12:45 PM
Subject: Re: [OMPI users] system() call corrupts MPI processes
 

I'm using TCP on 1.4.1 (its actually IPoIB)
OpenIB is compiled in.
Note that these nodes are containers running in OpenVZ where IB is not 
available.  there may be some SDP running in system level routines on the VH 
but this is unlikely.
OpenIB is not available to the VMs.  they happily get TCP services from the VH
In any case, the problem still occurs if I use: --mca btl tcp,self

I have traced the perl code and observed that OpenMPI gets confused whenever 
the perl program executes a system command itself
eg:
`command 2>&1 1> log`;

This probably narrows it down (I hope)


 From: Jeff Squyres 
To: Randolph Pullen ; Open MPI Users 
 
Sent: Friday, 20 January 2012 1:52 AM
Subject: Re: [OMPI users] system() call corrupts MPI processes
 
Which network transport are you using, and what version of Open MPI are you 
using?  Do you have OpenFabrics support compiled into your Open MPI 
installation?

If you're just using TCP and/or shared memory, I can't think of a reason 
immediately as to why this wouldn't work, but there may be a subtle interaction 
in there
 somewhere that causes badness (e.g., memory corruption).


On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:

> 
> I have a section in my code running in rank 0 that must start a perl program 
> that it then connects to via a tcp socket.
> The initialisation section is shown here:
> 
>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>     int i = system(buf);
>     printf("system returned %d\n", i);
> 
> 
> Some time after I run this code, while waiting for the data from the perl 
> program, the error below occurs:
> 
> qplan connection
> DCsession_fetch: waiting for Mcode data...
> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
> to a process whose contact information is unknown in file rml_oob_send.c at 
> line 105
> [dc1:05387] [[40050,1],0] could not get route to
 [[INVALID],INVALID]
> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
> to a process whose contact information is unknown in file 
> base/plm_base_proxy.c at line 86
> 
> 
> It seems that the linux system() call is breaking OpenMPI internal 
> connections.  Removing the system() call and executing the perl code 
> externaly fixes the problem but I can't go into production like that as its a 
> security issue.
> 
> Any ideas ?
> 
> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff
 Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] system() call corrupts MPI processes

2012-01-19 Thread Ralph Castain
Hmmm..well, unless you happened to guess poorly, I would think it unusually bad 
luck to have happened to pick the wrong port! OMPI should be selecting one via 
the OS, so the port it gets is somewhat random.

I'm trying to understand what your MPI proc is doing. The function that is 
complaining is one used only for a call to MPI_Comm_spawn - are you executing 
such a call? This message normally would go to mpirun, which your proc should 
certainly know.


On Jan 19, 2012, at 5:54 PM, Randolph Pullen wrote:

> Hi Ralph,
> 
> The port is defined in config as 5000 it is used by both versions so you 
> would think both would fail if where an issue.
> Is there any way of reserving ports for non MPI use?
> 
> From: Ralph Castain 
> To: Open MPI Users  
> Sent: Friday, 20 January 2012 10:30 AM
> Subject: Re: [OMPI users] system() call corrupts MPI processes
> 
> Hi Randolph!
> 
> Sorry for delay - was on the road. This isn't an issue of corruption. What 
> ORTE is complaining about is that your perl script wound up connecting to the 
> same port that your process is listening on via ORTE. ORTE is rather 
> particular about the message format - specifically, it requires a header that 
> includes the name of the process and a bunch of other stuff.
> 
> Where did you get the port that you are passing to your perl script?
> 
> 
> On Jan 19, 2012, at 8:22 AM, Durga Choudhury wrote:
> 
> > This is just a thought:
> > 
> > according to the system() man page, 'SIGCHLD' is blocked during the
> > execution of the program. Since you are executing your command as a
> > daemon in the background, it will be permanently blocked.
> > 
> > Does OpenMPI daemon depend on SIGCHLD in any way? That is about the
> > only difference that I can think of between running the command
> > stand-alone (which works) and running via a system() API call (that
> > does not work).
> > 
> > Best
> > Durga
> > 
> > 
> > On Thu, Jan 19, 2012 at 9:52 AM, Jeff Squyres  wrote:
> >> Which network transport are you using, and what version of Open MPI are 
> >> you using?  Do you have OpenFabrics support compiled into your Open MPI 
> >> installation?
> >> 
> >> If you're just using TCP and/or shared memory, I can't think of a reason 
> >> immediately as to why this wouldn't work, but there may be a subtle 
> >> interaction in there somewhere that causes badness (e.g., memory 
> >> corruption).
> >> 
> >> 
> >> On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
> >> 
> >>> 
> >>> I have a section in my code running in rank 0 that must start a perl 
> >>> program that it then connects to via a tcp socket.
> >>> The initialisation section is shown here:
> >>> 
> >>>sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
> >>>int i = system(buf);
> >>>printf("system returned %d\n", i);
> >>> 
> >>> 
> >>> Some time after I run this code, while waiting for the data from the perl 
> >>> program, the error below occurs:
> >>> 
> >>> qplan connection
> >>> DCsession_fetch: waiting for Mcode data...
> >>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be 
> >>> sent to a process whose contact information is unknown in file 
> >>> rml_oob_send.c at line 105
> >>> [dc1:05387] [[40050,1],0] could not get route to [[INVALID],INVALID]
> >>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be 
> >>> sent to a process whose contact information is unknown in file 
> >>> base/plm_base_proxy.c at line 86
> >>> 
> >>> 
> >>> It seems that the linux system() call is breaking OpenMPI internal 
> >>> connections.  Removing the system() call and executing the perl code 
> >>> externaly fixes the problem but I can't go into production like that as 
> >>> its a security issue.
> >>> 
> >>> Any ideas ?
> >>> 
> >>> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
> >>> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> >> 
> >> --
> >> Jeff Squyres
> >> jsquy...@cisco.com
> >> For corporate legal information go to:
> >> http://www.cisco.com/web/about/doing_business/legal/cri/
> >> 
> >> 
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen
Hi Ralph,

The port is defined in config as 5000 it is used by both versions so you would 
think both would fail if where an issue.
Is there any way of reserving ports for non MPI use?



 From: Ralph Castain 
To: Open MPI Users  
Sent: Friday, 20 January 2012 10:30 AM
Subject: Re: [OMPI users] system() call corrupts MPI processes
 
Hi Randolph!

Sorry for delay - was on the road. This isn't an issue of corruption. What ORTE 
is complaining about is that your perl script wound up connecting to the same 
port that your process is listening on via ORTE. ORTE is rather particular 
about the message format - specifically, it requires a header that includes the 
name of the process and a bunch of other stuff.

Where did you get the port that you are passing to your perl script?


On Jan 19, 2012, at 8:22 AM, Durga Choudhury wrote:

> This is just a thought:
> 
> according to the system() man page, 'SIGCHLD' is blocked during the
> execution of the program. Since you are executing your command as a
> daemon in the background, it will be permanently blocked.
> 
> Does OpenMPI daemon depend on SIGCHLD in any way? That is about the
> only difference that I can think of between running the command
> stand-alone (which works) and running via a system() API call (that
> does not work).
> 
> Best
> Durga
> 
> 
> On Thu, Jan 19, 2012 at 9:52 AM, Jeff Squyres  wrote:
>> Which network transport are you using, and what version of Open MPI are you 
>> using?  Do you have OpenFabrics support compiled into your Open MPI 
>> installation?
>> 
>> If you're just using TCP and/or shared memory, I can't think of a reason 
>> immediately as to why this wouldn't work, but there may be a subtle 
>> interaction in there somewhere that causes badness (e.g., memory corruption).
>> 
>> 
>> On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
>> 
>>> 
>>> I have a section in my code running in rank 0 that must start a perl 
>>> program that it then connects to via a tcp socket.
>>> The initialisation section is shown here:
>>> 
>>>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>>>     int i = system(buf);
>>>     printf("system returned %d\n", i);
>>> 
>>> 
>>> Some time after I run this code, while waiting for the data from the perl 
>>> program, the error below occurs:
>>> 
>>> qplan connection
>>> DCsession_fetch: waiting for Mcode data...
>>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be 
>>> sent to a process whose contact information is unknown in file 
>>> rml_oob_send.c at line 105
>>> [dc1:05387] [[40050,1],0] could not get route to [[INVALID],INVALID]
>>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be 
>>> sent to a process whose contact information is unknown in file 
>>> base/plm_base_proxy.c at line 86
>>> 
>>> 
>>> It seems that the linux system() call is breaking OpenMPI internal 
>>> connections.  Removing the system() call and executing the perl code 
>>> externaly fixes the problem but I can't go into production like that as its 
>>> a security issue.
>>> 
>>> Any ideas ?
>>> 
>>> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
>>> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen
I assume that the SIGCHLD was released after starting the daemon ie on return 
of the system() call



 From: Durga Choudhury 
To: Open MPI Users  
Sent: Friday, 20 January 2012 2:22 AM
Subject: Re: [OMPI users] system() call corrupts MPI processes
 
This is just a thought:

according to the system() man page, 'SIGCHLD' is blocked during the
execution of the program. Since you are executing your command as a
daemon in the background, it will be permanently blocked.

Does OpenMPI daemon depend on SIGCHLD in any way? That is about the
only difference that I can think of between running the command
stand-alone (which works) and running via a system() API call (that
does not work).

Best
Durga


On Thu, Jan 19, 2012 at 9:52 AM, Jeff Squyres  wrote:
> Which network transport are you using, and what version of Open MPI are you 
> using?  Do you have OpenFabrics support compiled into your Open MPI 
> installation?
>
> If you're just using TCP and/or shared memory, I can't think of a reason 
> immediately as to why this wouldn't work, but there may be a subtle 
> interaction in there somewhere that causes badness (e.g., memory corruption).
>
>
> On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
>
>>
>> I have a section in my code running in rank 0 that must start a perl program 
>> that it then connects to via a tcp socket.
>> The initialisation section is shown here:
>>
>>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>>     int i = system(buf);
>>     printf("system returned %d\n", i);
>>
>>
>> Some time after I run this code, while waiting for the data from the perl 
>> program, the error below occurs:
>>
>> qplan connection
>> DCsession_fetch: waiting for Mcode data...
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file rml_oob_send.c at 
>> line 105
>> [dc1:05387] [[40050,1],0] could not get route to [[INVALID],INVALID]
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file 
>> base/plm_base_proxy.c at line 86
>>
>>
>> It seems that the linux system() call is breaking OpenMPI internal 
>> connections.  Removing the system() call and executing the perl code 
>> externaly fixes the problem but I can't go into production like that as its 
>> a security issue.
>>
>> Any ideas ?
>>
>> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
>> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [hwloc-users] Bogus files in 64bit Windows binary distribution (1.4rc1)

2012-01-19 Thread Samuel Thibault
Hartmut Kaiser, le Fri 20 Jan 2012 00:43:32 +0100, a écrit :
> > Hartmut Kaiser, le Thu 19 Jan 2012 22:48:50 +0100, a écrit :
> > > We are using hwloc with VS2010 and were happy to realize that after
> > > the (for
> > > us) totally broken Windows binary distribution in V1.3
> > 
> > Broken?  How so?  It worked for me.
> 
> Try it, the autoconf/config.h has settings not compatible with VC++, for
> instance:
> 
> /* Maybe before gcc 2.95 too */
> #if !defined(HWLOC_HAVE_ATTRIBUTE_UNUSED) && defined(__GNUC__)
> # define HWLOC_HAVE_ATTRIBUTE_UNUSED 1
> #else
> # define HWLOC_HAVE_ATTRIBUTE_UNUSED 1
> #endif
> #if HWLOC_HAVE_ATTRIBUTE_UNUSED
> # define __hwloc_attribute_unused __attribute__((__unused__))
> #else
> # define __hwloc_attribute_unused
> #endif
> 
> etc. This essentially always defines __hwloc_attribute_unused to expand to
> the __attribute__() (from hwloc-win64-build-1.3.1.zip).

Ok, so the problem is not actually in the binaries, but the headers :)

This was also reported in another case and already fixed for the next
1.3 release.

Samuel


Re: [hwloc-users] Bogus files in 64bit Windows binary distribution (1.4rc1)

2012-01-19 Thread Hartmut Kaiser
> Hartmut Kaiser, le Thu 19 Jan 2012 22:48:50 +0100, a écrit :
> > We are using hwloc with VS2010 and were happy to realize that after
> > the (for
> > us) totally broken Windows binary distribution in V1.3
> 
> Broken?  How so?  It worked for me.

Try it, the autoconf/config.h has settings not compatible with VC++, for
instance:

/* Maybe before gcc 2.95 too */
#if !defined(HWLOC_HAVE_ATTRIBUTE_UNUSED) && defined(__GNUC__)
# define HWLOC_HAVE_ATTRIBUTE_UNUSED 1
#else
# define HWLOC_HAVE_ATTRIBUTE_UNUSED 1
#endif
#if HWLOC_HAVE_ATTRIBUTE_UNUSED
# define __hwloc_attribute_unused __attribute__((__unused__))
#else
# define __hwloc_attribute_unused
#endif

etc. This essentially always defines __hwloc_attribute_unused to expand to
the __attribute__() (from hwloc-win64-build-1.3.1.zip).

> Not-reported bugs are usually not fixed.

Sure, I was about to report it when I found the V1.4rc1 to be usable.

> > Some investigation showed that the file libhwloc.lib was compiled for
> > 32bit and therefore causes trouble in 64bit builds.
> 
> Oh, that's possible indeed, I need to fix the build script to pass
> whatever flag is needed to make a 64bit .lib.  You should be able to do it
> from the provided .def file, using the lib.exe tool from VS.

Yep, that's what I did.

> > While it is trivial to regenerate the corresponding 64bit import
> > library from the supplied definition file, it would be nice to be able
> > to directly use the distributive from your site.
> 
> Sure, thanks for the report!

Regards Hartmut
---
http://boost-spirit.com
http://stellar.cct.lsu.edu





Re: [OMPI users] system() call corrupts MPI processes

2012-01-19 Thread Ralph Castain
Hi Randolph!

Sorry for delay - was on the road. This isn't an issue of corruption. What ORTE 
is complaining about is that your perl script wound up connecting to the same 
port that your process is listening on via ORTE. ORTE is rather particular 
about the message format - specifically, it requires a header that includes the 
name of the process and a bunch of other stuff.

Where did you get the port that you are passing to your perl script?


On Jan 19, 2012, at 8:22 AM, Durga Choudhury wrote:

> This is just a thought:
> 
> according to the system() man page, 'SIGCHLD' is blocked during the
> execution of the program. Since you are executing your command as a
> daemon in the background, it will be permanently blocked.
> 
> Does OpenMPI daemon depend on SIGCHLD in any way? That is about the
> only difference that I can think of between running the command
> stand-alone (which works) and running via a system() API call (that
> does not work).
> 
> Best
> Durga
> 
> 
> On Thu, Jan 19, 2012 at 9:52 AM, Jeff Squyres  wrote:
>> Which network transport are you using, and what version of Open MPI are you 
>> using?  Do you have OpenFabrics support compiled into your Open MPI 
>> installation?
>> 
>> If you're just using TCP and/or shared memory, I can't think of a reason 
>> immediately as to why this wouldn't work, but there may be a subtle 
>> interaction in there somewhere that causes badness (e.g., memory corruption).
>> 
>> 
>> On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
>> 
>>> 
>>> I have a section in my code running in rank 0 that must start a perl 
>>> program that it then connects to via a tcp socket.
>>> The initialisation section is shown here:
>>> 
>>> sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>>> int i = system(buf);
>>> printf("system returned %d\n", i);
>>> 
>>> 
>>> Some time after I run this code, while waiting for the data from the perl 
>>> program, the error below occurs:
>>> 
>>> qplan connection
>>> DCsession_fetch: waiting for Mcode data...
>>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be 
>>> sent to a process whose contact information is unknown in file 
>>> rml_oob_send.c at line 105
>>> [dc1:05387] [[40050,1],0] could not get route to [[INVALID],INVALID]
>>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be 
>>> sent to a process whose contact information is unknown in file 
>>> base/plm_base_proxy.c at line 86
>>> 
>>> 
>>> It seems that the linux system() call is breaking OpenMPI internal 
>>> connections.  Removing the system() call and executing the perl code 
>>> externaly fixes the problem but I can't go into production like that as its 
>>> a security issue.
>>> 
>>> Any ideas ?
>>> 
>>> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
>>> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [hwloc-users] removing old cpuset API?

2012-01-19 Thread Αλέξανδρος Παπαδογιαννάκης

Ermm localhost?

--
Alexandros Papadogiannakis


> Date: Thu, 19 Jan 2012 22:28:14 +0100
> From: brice.gog...@inria.fr
> To: hwloc-us...@open-mpi.org
> Subject: [hwloc-users] removing old cpuset API?
> 
> Dear hwloc users,
> 
> The cpuset API (hwloc_cpuset_*) was replaced by the bitmap API
> (hwloc_bitmap_*) in v1.1.0, back in december 2010. We kept backward
> compatibility by #defin'ing the old API on top of the new one. So you
> may stil use the old API in your application (but you would get
> "deprecated" warning).
> 
> Now, we're thinking of removing this compatiblity layer one day or
> another. You would have to upgrade your application to the new API. If
> your application must still work with old hwloc too, you may support
> both API by #defin'ing the new API on top of the old one as explained at
> the end of http://localhost/hwloc/projects/hwloc/doc/v1.3.1/a00010.php
> 
> So, is anybody against removing the hwloc/cpuset.h compatibility layer
> in the near future (not before v1.5, which may mean Spring 2012) and
> letting application deal with this if they really need it?
> 
> Brice
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
  

[hwloc-users] removing old cpuset API?

2012-01-19 Thread Brice Goglin
Dear hwloc users,

The cpuset API (hwloc_cpuset_*) was replaced by the bitmap API
(hwloc_bitmap_*) in v1.1.0, back in december 2010. We kept backward
compatibility by #defin'ing the old API on top of the new one. So you
may stil use the old API in your application (but you would get
"deprecated" warning).

Now, we're thinking of removing this compatiblity layer one day or
another. You would have to upgrade your application to the new API. If
your application must still work with old hwloc too, you may support
both API by #defin'ing the new API on top of the old one as explained at
the end of http://localhost/hwloc/projects/hwloc/doc/v1.3.1/a00010.php

So, is anybody against removing the hwloc/cpuset.h compatibility layer
in the near future (not before v1.5, which may mean Spring 2012) and
letting application deal with this if they really need it?

Brice



Re: [OMPI users] localhost only

2012-01-19 Thread Ralph Castain
No argument - we should support it. I was on travel and just got back, so I'll 
take a look and see why we aren't doing so.

On Jan 17, 2012, at 9:38 AM, Gustavo Correa wrote:

> But that would be yet another legitimate singleton-edge-case that OpenMPI 
> could
> proudly handle, wouldn't it?
> No network connection, yet OpenMPI is still operational in a standalone 
> insular machine.
> IMHO, MM made a good case for his 'commuter MPI'.  :)
> Times change, I would never have thought that people are doing 
> parallel programming and HPC in the subway.
> Gus Correa
> 
> On Jan 17, 2012, at 11:25 AM, Ralph Castain wrote:
> 
>> I think it won't help - it looks like mpirun itself aborts if it only finds 
>> a loopback available.
>> 
>> On Tue, Jan 17, 2012 at 9:24 AM, Gustavo Correa  
>> wrote:
>> MM
>> Have you tried adding '-mca btl sm,self' to your mpirun command line,
>> as suggested by Terry? [despite the low chances that it would work ...]
>> If somehow the loopback interface is up, wouldn't it work?
>> Gus Correa
>> 
>> On Jan 17, 2012, at 7:01 AM, MM wrote:
>> 
>>> Gus, unfortunately, it doesn't seem to change the error.
>>> Ralph,  with the wireless adapter disabled, netstat on winxp still shows 
>>> these ports as listening:
>>> Shouldn't the MS TCP Loopback interface allow the tcp ports to be created?
>>> 
>>> 
 netstat -an
>>> 
>>> Active Connections
>>> 
>>>  Proto  Local Address  Foreign AddressState
>>>  TCP0.0.0.0:1350.0.0.0:0  LISTENING
>>>  TCP0.0.0.0:4450.0.0.0:0  LISTENING
>>>  TCP0.0.0.0:2967   0.0.0.0:0  LISTENING
>>>  TCP0.0.0.0:3389   0.0.0.0:0  LISTENING
>>>  TCP0.0.0.0:4445   0.0.0.0:0  LISTENING
>>>  TCP0.0.0.0:57632  0.0.0.0:0  LISTENING
>>>  TCP127.0.0.1:1025 0.0.0.0:0  LISTENING
>>>  TCP127.0.0.1:625140.0.0.0:0  LISTENING
>>> 
 route print
>>> ===
>>> Interface List
>>> 0x1 ... MS TCP Loopback interface
>>> 0x2 ...00 24 d6 10 05 4e .. Intel(R) WiFi Link 5100 AGN - Packet 
>>> Scheduler Miniport
>>> ===
>>> ===
>>> Active Routes:
>>> Network DestinationNetmask  Gateway   Interface  Metric
>>>127.0.0.0255.0.0.0127.0.0.1   127.0.0.1   1
>>>  255.255.255.255  255.255.255.255  255.255.255.255   2   1
>>> ===
>>> Persistent Routes:
>>>  None
>>> 
>>> -Original Message-
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>>> Behalf Of Gustavo Correa
>>> Sent: 16 January 2012 23:54
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] localhost only
>>> 
>>> Have you tried to specify the hosts with something like this?
>>> 
>>> mpirun -np 2 -host localhost ./my_program
>>> 
>>> See 'man mpirun' for more details.
>>> 
>>> I hope it helps,
>>> Gus Correa
>>> 
>>> On Jan 16, 2012, at 6:34 PM, MM wrote:
>>> 
 hi,
 
 when my wireless adapter is down on my laptop, only localhost is 
 configured.
 In this case, when I mpirun 2 binaries on my laptop, mpirun fails with 
 this error:
 
 
 It looks like orte_init failed for some reason; your parallel process i
 likely to abort.  There are many reasons that a parallel process can
 fail during orte_init; some of which are due to configuration or
 environment problems.  This failure appears to be an internal failure;
 here's some additional information (which may only be relevant to an
 Open MPI developer):
 
  orte_rml_base_select failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
 
 
 
 when I turn on the wireless adapter back on, the mpirun works fine
 
 Is there a way to make mpirun realize all my binaries run on the same box, 
 and therefore don't need any other interface but localhost?
 
 PS: this is ipconfig when the wireless adapter is off
 
> ipconfig /all
 
 Windows IP Configuration
 
Host Name . . . . . . . . . . . . :
Primary Dns Suffix  . . . . . . . :
Node Type . . . . . . . . . . . . : Hybrid
IP Routing Enabled. . . . . . . . : No
WINS Proxy Enabled. . . . . . . . : No
 
 Ethernet adapter Wireless Network Connection:
 
Media State . . . . . . . . . . . : Media disconnected
Description . . . . . . . . . . . : Intel(R) WiFi Link 5100 AGN
Physical Address. . . . . . . . . :
 
 rds,
 
 MM
 

Re: [OMPI users] Possible bug in finalize, OpenMPI v1.5, head revision

2012-01-19 Thread Ralph Castain
I didn't commit anything to the v1.5 branch yesterday - just the trunk.

As I told Mike off-list, I think it may have been that the binary was compiled 
against a different OMPI version by mistake. It looks very much like what I'd 
expect to have happen in that scenario.

On Jan 19, 2012, at 7:52 AM, Jeff Squyres wrote:

> Did you "svn up"?  I ask because Ralph committed some stuff yesterday that 
> may have fixed this.
> 
> 
> On Jan 18, 2012, at 5:19 PM, Andrew Senin wrote:
> 
>> No, nothing specific. Only basic settings (--mca btl openib,self
>> --npernode 1, etc).
>> 
>> Actually I'm were confused with this error because today it just
>> disapeared. I had 2 separate folders where it was reproduced in 100%
>> of test runs. Today I recompiled the source and it is gone in both
>> folders. But yesterday I tried recompiling multiple times with no
>> effect. So I believe this must be somehow related to some unknown
>> settings in the lab which have been changed. Trying to reproduce the
>> crash now...
>> 
>> Regards,
>> Andrew Senin.
>> 
>> On Thu, Jan 19, 2012 at 12:05 AM, Jeff Squyres  wrote:
>>> Jumping in pretty late in this thread here...
>>> 
>>> I see that it's failing in opal_hwloc_base_close().  That's a little 
>>> worrysome.
>>> 
>>> I do see an odd path through the hwloc initialization that *could* cause an 
>>> error during finalization -- but it would involve you setting an invalid 
>>> value for an MCA parameter.  Are you setting 
>>> hwloc_base_mem_bind_failure_action or
>>> hwloc_base_mem_alloc_policy, perchance?
>>> 
>>> 
>>> On Jan 16, 2012, at 1:56 PM, Andrew Senin wrote:
>>> 
 Hi,
 
 I think I've found a bug in the hear revision of the OpenMPI 1.5
 branch. If it is configured with --disable-debug it crashes in
 finalize on the hello_c.c example. Did I miss something out?
 
 Configure options:
 ./configure --with-pmi=/usr/ --with-slurm=/usr/ --without-psm
 --disable-debug --enable-mpirun-prefix-by-default
 --prefix=/hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install
 
 Runtime command and output:
 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:../lib ./mpirun --mca btl openib,self
 --npernode 1 --host mir1,mir2 ./hello
 
 Hello, world, I am 0 of 2
 Hello, world, I am 1 of 2
 [mir1:05542] *** Process received signal ***
 [mir1:05542] Signal: Segmentation fault (11)
 [mir1:05542] Signal code: Address not mapped (1)
 [mir1:05542] Failing at address: 0xe8
 [mir2:10218] *** Process received signal ***
 [mir2:10218] Signal: Segmentation fault (11)
 [mir2:10218] Signal code: Address not mapped (1)
 [mir2:10218] Failing at address: 0xe8
 [mir1:05542] [ 0] /lib64/libpthread.so.0() [0x390d20f4c0]
 [mir1:05542] [ 1]
 /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(+0x1346a8)
 [0x7f4588cee6a8]
 [mir1:05542] [ 2]
 /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_hwloc_base_close+0x32)
 [0x7f4588cee700]
 [mir1:05542] [ 3]
 /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_finalize+0x73)
 [0x7f4588d1beb2]
 [mir1:05542] [ 4]
 /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(orte_finalize+0xfe)
 [0x7f4588c81eb5]
 [mir1:05542] [ 5]
 /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(ompi_mpi_finalize+0x67a)
 [0x7f4588c217c3]
 [mir1:05542] [ 6]
 /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(PMPI_Finalize+0x59)
 [0x7f4588c39959]
 [mir1:05542] [ 7] ./hello(main+0x69) [0x4008fd]
 [mir1:05542] [ 8] /lib64/libc.so.6(__libc_start_main+0xfd) [0x390ca1ec5d]
 [mir1:05542] [ 9] ./hello() [0x4007d9]
 [mir1:05542] *** End of error message ***
 [mir2:10218] [ 0] /lib64/libpthread.so.0() [0x3a6dc0f4c0]
 [mir2:10218] [ 1]
 /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(+0x1346a8)
 [0x7f409f31d6a8]
 [mir2:10218] [ 2]
 /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_hwloc_base_close+0x32)
 [0x7f409f31d700]
 [mir2:10218] [ 3]
 /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_finalize+0x73)
 [0x7f409f34aeb2]
 [mir2:10218] [ 4]
 /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(orte_finalize+0xfe)
 [0x7f409f2b0eb5]
 [mir2:10218] [ 5]
 /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(ompi_mpi_finalize+0x67a)
 [0x7f409f2507c3]
 [mir2:10218] [ 6]
 /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(PMPI_Finalize+0x59)
 [0x7f409f268959]
 [mir2:10218] [ 7] ./hello(main+0x69) [0x4008fd]
 [mir2:10218] [ 8] 

Re: [OMPI users] MPI_Type_struct for template class with dynamic arrays and objs. instantiated from other classes

2012-01-19 Thread Jeff Squyres
Have you looked at boost.mpi?  They have some nice C++-friendly constructors 
for different types to send/receive via MPI.

If boost.mpi doesn't do what you want, you'll likely need to have a custom MPI 
datatype constructed for each instance that you want to send/receive (and have 
that same datatype at both the sender and receiver), because both the internal 
types in the instance and the sizes of the arrays may differ in each instance.


On Jan 18, 2012, at 12:36 AM, Victor Pomponiu wrote:

> Hi, 
> 
> for several days I am trying to create MPI derived datatype in order to 
> send/receive an user defined object. I'm trying to implement it using 
> MPI::Datatype::Create_struct. 
> I have consulted several threads from the archive
> 
> http://www.open-mpi.org/community/lists/users/2012/01/18093.php
> http://www.open-mpi.org/community/lists/users/2005/08/0123.php
> http://www.open-mpi.org/community/lists/users/2008/08/6302.php
> 
> but I'm still havening difficulties to solve this issue.
> There are some particular features that makes the task more difficult. Let me 
> explain: the obj. that I want to transmit is instantiated from a class called 
> MemberBlock. This class is a template class and contains: dynamic allocating 
> arrays, and objs. instantiated from other classes. Bellow is the class 
> declaration.
> 
> Therefore how can I construct a MPI dervied data type in this situation? Any 
> suggestions are highly appreciated 
> 
> Thank you,
> Victor Pomponiu
> 
> -
> /**
>  * VecData.h: Interface class for data appearing in vector format.
>  */
> # include "DistData.h" //Interface class for data having a pairwise 
> distance measure.
> 
> class VecData: public DistData {
> 
> public:
> // no properties, only public/private methods;
> .
> }
> 
> /**
>  * VecDataBlock.h:Base class for storable data having a pairwise 
>  * distance measure.
> */ 
> 
> class VecDataBlock {
> 
> public:
>   VecData** dataList;   // Array of data items for this block.
>   
>   int numItems;   // Number of data items assigned to the 
> block.
>   int blockID;  // Integer identifier for this block.
>   int sampleID;   // The sample identifier for this block 
> 
>   int globalOffset;   // Index of the first block item relative 
> to the full data set.
>   char* fileNamePrefix;   // The file name prefix used for saving data to 
> disk.
>   char commentChar; // The character denoting input comment lines.
> 
> // methods ..
> }
> 
> 
> /**
>  * MemberBlock.h:   Class storing and managing member lists for a given
>  *block of data objects.
>  */
> 
> class MemberBlock_base {
> public:
>   virtual ~MemberBlock_base () {};
> };
> 
> template 
> class MemberBlock: public MemberBlock_base {
> 
> public:
>   char* fileNamePrefix; // The file name prefix for the block save 
> file.
>   ofstream* saveFile;   // refers to an open file currently being 
> used for accumulating
>   VecDataBlock* dataBlock; // The block of data items upon which  
>  
>   int globalOffset;// The position of this block with respect 
> to the global ordering.
>   int numItems;  // The number of data items assigned to the 
> block.
>   int sampleLevel;  // The sample level from which
> 
>   ScoreType** memberScoreLList;  // the scores of members associated with
>//   each data the item.   
>   
>   int** memberIndexLList;//  for each data item a list of global indices 
> of its members.   
>   int* memberSizeList;//   the number of list members.
> 
>   int memberListBufferSize;   // buffer size for storing an individual member 
> list.
>   int saveCount;// Keeps track of the number of member lists  
> saved 
>   float* tempDistBuffer;  // A temporary buffer for storing distances, 
> used for breaking
> 
> //methods
> }
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Bug Report for MPI_Alltoall

2012-01-19 Thread Jeff Squyres
Thanks for the bug report!

I have no idea how this slipped through the cracks, but IN_PLACE support for 
MPI_ALLTOALL seems to be missing.  :-(

I've filed a bug about it: https://svn.open-mpi.org/trac/ompi/ticket/2965.


On Jan 18, 2012, at 4:38 PM, David Race wrote:

> One of our users makes use of the MPI_IN_PLACE option, but there appears to 
> be a bug in the MPI_Alltoall.  According to the specification -
> 
> 
> The “in place” option for intracommunicators is specified by passing 
> MPI_IN_PLACE to
> the argument sendbuf at all processes. In such a case, sendcount and sendtype 
> are ignored.
> The data to be sent is taken from the recvbuf and replaced by the received 
> data. Data sent
> and received must have the same type map as specified by recvcount and 
> recvtype.
> 
> The application fails with 
> 
> 
> [prod-0002:12156] *** An error occurred in MPI_Alltoall
> [prod-0002:12156] *** on communicator MPI_COMM_WORLD
> [prod-0002:12156] *** MPI_ERR_ARG: invalid argument of some other kind
> [prod-0002:12156] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> 
> The file below shows the potential bug:
> 
> 
> 
> //
> //  test program for the potential invalid argument bug
> //
> //  David Race
> //  18 Jan 2012
> //
> 
> #include 
> #include 
> #include 
> #include 
> //
> //  mpi
> //
> #include "mpi.h"
> #define  MAX_SIZE 32
> //
> //
> //
> int main ( int argc, char *argv[] )
> {
>   //
>   //  definitions
>   //
>   int mpierror, isize, myRank; 
>   int typeSize;
>   int valA[MAX_SIZE], valB[MAX_SIZE];
>   int i, j;
>   int commRoot;
>   //
>   //  start processing
>   //
>   printf("Start of program\n");
>   printf("SIZE OF VALA %ld\n",sizeof(valA));
> 
>   mpierror = MPI_Init ( ,  );
>   mpierror = MPI_Comm_rank ( MPI_COMM_WORLD,  );
>   mpierror = MPI_Comm_size ( MPI_COMM_WORLD,  );
>   MPI_Barrier(MPI_COMM_WORLD);
>   //
>   //  test the mpi_type_size using MPI_Alltoall
>   //
>   if (myRank == 0) {
>   
> printf("=\n");
>   printf("   Alltoall : Should work
> \n");
>   
> printf("=\n");
>   }
>   fflush(stdout);
>   for(i=0;i   valA[i] = i;
>   valB[i] = -1;
>   }
>   commRoot = 0;
>   MPI_Barrier(MPI_COMM_WORLD);
>   mpierror = MPI_Alltoall(valA, 1, MPI_INT, valB, 1, MPI_INT, 
> MPI_COMM_WORLD);
>   MPI_Barrier(MPI_COMM_WORLD);
>   for (j=0;j   MPI_Barrier(MPI_COMM_WORLD);
>   if (myRank == j) {
>   for(i=0;i from node %d is %d\n",myRank, i, valA[i]);
>   }
>   fflush(stdout);
>   }
>   //
>   //  test the mpi_type_size using MPI_Alltoall
>   //
>   MPI_Barrier(MPI_COMM_WORLD);
>   if (myRank == 0) {
>   
> printf("=\n");
>   printf("Alltoall :   
> \n");
>   
> printf("=\n");
>   }
>   fflush(stdout);
>   for(i=0;i   commRoot = 0;
>   MPI_Barrier(MPI_COMM_WORLD);
>   //
>   //  The error occurs here
>   //
>   mpierror = MPI_Alltoall(MPI_IN_PLACE, 1, MPI_INT, valA, 1, MPI_INT, 
> MPI_COMM_WORLD);
>   MPI_Barrier(MPI_COMM_WORLD);
>   if (myRank == 0) {
>   for(i=0;i %d is %d\n",myRank, i,valA[i]);
>   }
>   fflush(stdout);
>   MPI_Barrier(MPI_COMM_WORLD);
>   if (myRank == 1) {
>   for(i=0;i %d is %d\n",myRank, i, valA[i]);
>   }
>   fflush(stdout);
>   MPI_Barrier(MPI_COMM_WORLD);
>   //
>   //  test the mpi_type_size using MPI_Alltoall
>   //
>   if (myRank == 0) {
>   
> printf("=\n");
>   printf("   Alltoall : Failure with some MPI  
> \n");
>   
> printf("=\n");
>   }
>   fflush(stdout);
>   for(i=0;i   commRoot = 0;
>   MPI_Barrier(MPI_COMM_WORLD);
>   //
>   //  The error occurs here
>   //
>   mpierror = MPI_Alltoall(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, valA, 1, 
> MPI_INT, MPI_COMM_WORLD);
>   MPI_Barrier(MPI_COMM_WORLD);
>   if (myRank == 0) {
>   for(i=0;i %d 

Re: [OMPI users] system() call corrupts MPI processes

2012-01-19 Thread Durga Choudhury
This is just a thought:

according to the system() man page, 'SIGCHLD' is blocked during the
execution of the program. Since you are executing your command as a
daemon in the background, it will be permanently blocked.

Does OpenMPI daemon depend on SIGCHLD in any way? That is about the
only difference that I can think of between running the command
stand-alone (which works) and running via a system() API call (that
does not work).

Best
Durga


On Thu, Jan 19, 2012 at 9:52 AM, Jeff Squyres  wrote:
> Which network transport are you using, and what version of Open MPI are you 
> using?  Do you have OpenFabrics support compiled into your Open MPI 
> installation?
>
> If you're just using TCP and/or shared memory, I can't think of a reason 
> immediately as to why this wouldn't work, but there may be a subtle 
> interaction in there somewhere that causes badness (e.g., memory corruption).
>
>
> On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:
>
>>
>> I have a section in my code running in rank 0 that must start a perl program 
>> that it then connects to via a tcp socket.
>> The initialisation section is shown here:
>>
>>     sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
>>     int i = system(buf);
>>     printf("system returned %d\n", i);
>>
>>
>> Some time after I run this code, while waiting for the data from the perl 
>> program, the error below occurs:
>>
>> qplan connection
>> DCsession_fetch: waiting for Mcode data...
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file rml_oob_send.c at 
>> line 105
>> [dc1:05387] [[40050,1],0] could not get route to [[INVALID],INVALID]
>> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
>> to a process whose contact information is unknown in file 
>> base/plm_base_proxy.c at line 86
>>
>>
>> It seems that the linux system() call is breaking OpenMPI internal 
>> connections.  Removing the system() call and executing the perl code 
>> externaly fixes the problem but I can't go into production like that as its 
>> a security issue.
>>
>> Any ideas ?
>>
>> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
>> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Checkpoint an MPI process

2012-01-19 Thread Lloyd Brown
Since you're looking for a function call, I'm going to assume that you
are writing this application, and it's not a pre-compiled, commercial
application.  Given that, it's going to be significantly better to have
an internal application checkpointing mechanism, where it serializes and
stores the data, etc., than to use an external, applicaiton-agnostic
checkpointing mechanism like BLCR or similar.  The application should be
aware of what data is important, how to most efficiently store it, etc.
 A generic library has to assume that everything is important, and store
it all.

Don't get me wrong.  Libraries like BLCR are great for applications that
don't have that visibility, and even as a tool for the
application-internal checkpointing mechanism (where the application
deliberately interacts with the library to annotate what's important to
store, and how to do so, etc.).  But if you're writing the application,
you're better off to handle it internally, than externally.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 01/19/2012 08:05 AM, Josh Hursey wrote:
> Currently Open MPI only supports the checkpointing of the whole
> application. There has been some work on uncoordinated checkpointing
> with message logging, though I do not know the state of that work with
> regards to availability. That work has been undertaken by the University
> of Tennessee Knoxville, so maybe they can provide more information.
> 
> -- Josh
> 
> On Wed, Jan 18, 2012 at 3:24 PM, Rodrigo Oliveira
> > wrote:
> 
> Hi,
> 
> I'd like to know if there is a way to checkpoint a specific process
> running under an mpirun call. In other words, is there a function
> CHECKPOINT(rank) in which I can pass the rank of the process I want
> to checkpoint? I do not want to checkpoint the entire application,
> but just one of its processes.
> 
> Thanks
> 
> ___
> users mailing list
> us...@open-mpi.org 
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Checkpoint an MPI process

2012-01-19 Thread Josh Hursey
Currently Open MPI only supports the checkpointing of the whole
application. There has been some work on uncoordinated checkpointing with
message logging, though I do not know the state of that work with regards
to availability. That work has been undertaken by the University of
Tennessee Knoxville, so maybe they can provide more information.

-- Josh

On Wed, Jan 18, 2012 at 3:24 PM, Rodrigo Oliveira  wrote:

> Hi,
>
> I'd like to know if there is a way to checkpoint a specific process
> running under an mpirun call. In other words, is there a function
> CHECKPOINT(rank) in which I can pass the rank of the process I want to
> checkpoint? I do not want to checkpoint the entire application, but just
> one of its processes.
>
> Thanks
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey


Re: [OMPI users] Possible bug in finalize, OpenMPI v1.5, head revision

2012-01-19 Thread Jeff Squyres
Did you "svn up"?  I ask because Ralph committed some stuff yesterday that may 
have fixed this.


On Jan 18, 2012, at 5:19 PM, Andrew Senin wrote:

> No, nothing specific. Only basic settings (--mca btl openib,self
> --npernode 1, etc).
> 
> Actually I'm were confused with this error because today it just
> disapeared. I had 2 separate folders where it was reproduced in 100%
> of test runs. Today I recompiled the source and it is gone in both
> folders. But yesterday I tried recompiling multiple times with no
> effect. So I believe this must be somehow related to some unknown
> settings in the lab which have been changed. Trying to reproduce the
> crash now...
> 
> Regards,
> Andrew Senin.
> 
> On Thu, Jan 19, 2012 at 12:05 AM, Jeff Squyres  wrote:
>> Jumping in pretty late in this thread here...
>> 
>> I see that it's failing in opal_hwloc_base_close().  That's a little 
>> worrysome.
>> 
>> I do see an odd path through the hwloc initialization that *could* cause an 
>> error during finalization -- but it would involve you setting an invalid 
>> value for an MCA parameter.  Are you setting 
>> hwloc_base_mem_bind_failure_action or
>> hwloc_base_mem_alloc_policy, perchance?
>> 
>> 
>> On Jan 16, 2012, at 1:56 PM, Andrew Senin wrote:
>> 
>>> Hi,
>>> 
>>> I think I've found a bug in the hear revision of the OpenMPI 1.5
>>> branch. If it is configured with --disable-debug it crashes in
>>> finalize on the hello_c.c example. Did I miss something out?
>>> 
>>> Configure options:
>>> ./configure --with-pmi=/usr/ --with-slurm=/usr/ --without-psm
>>> --disable-debug --enable-mpirun-prefix-by-default
>>> --prefix=/hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install
>>> 
>>> Runtime command and output:
>>> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:../lib ./mpirun --mca btl openib,self
>>> --npernode 1 --host mir1,mir2 ./hello
>>> 
>>> Hello, world, I am 0 of 2
>>> Hello, world, I am 1 of 2
>>> [mir1:05542] *** Process received signal ***
>>> [mir1:05542] Signal: Segmentation fault (11)
>>> [mir1:05542] Signal code: Address not mapped (1)
>>> [mir1:05542] Failing at address: 0xe8
>>> [mir2:10218] *** Process received signal ***
>>> [mir2:10218] Signal: Segmentation fault (11)
>>> [mir2:10218] Signal code: Address not mapped (1)
>>> [mir2:10218] Failing at address: 0xe8
>>> [mir1:05542] [ 0] /lib64/libpthread.so.0() [0x390d20f4c0]
>>> [mir1:05542] [ 1]
>>> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(+0x1346a8)
>>> [0x7f4588cee6a8]
>>> [mir1:05542] [ 2]
>>> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_hwloc_base_close+0x32)
>>> [0x7f4588cee700]
>>> [mir1:05542] [ 3]
>>> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_finalize+0x73)
>>> [0x7f4588d1beb2]
>>> [mir1:05542] [ 4]
>>> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(orte_finalize+0xfe)
>>> [0x7f4588c81eb5]
>>> [mir1:05542] [ 5]
>>> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(ompi_mpi_finalize+0x67a)
>>> [0x7f4588c217c3]
>>> [mir1:05542] [ 6]
>>> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(PMPI_Finalize+0x59)
>>> [0x7f4588c39959]
>>> [mir1:05542] [ 7] ./hello(main+0x69) [0x4008fd]
>>> [mir1:05542] [ 8] /lib64/libc.so.6(__libc_start_main+0xfd) [0x390ca1ec5d]
>>> [mir1:05542] [ 9] ./hello() [0x4007d9]
>>> [mir1:05542] *** End of error message ***
>>> [mir2:10218] [ 0] /lib64/libpthread.so.0() [0x3a6dc0f4c0]
>>> [mir2:10218] [ 1]
>>> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(+0x1346a8)
>>> [0x7f409f31d6a8]
>>> [mir2:10218] [ 2]
>>> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_hwloc_base_close+0x32)
>>> [0x7f409f31d700]
>>> [mir2:10218] [ 3]
>>> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_finalize+0x73)
>>> [0x7f409f34aeb2]
>>> [mir2:10218] [ 4]
>>> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(orte_finalize+0xfe)
>>> [0x7f409f2b0eb5]
>>> [mir2:10218] [ 5]
>>> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(ompi_mpi_finalize+0x67a)
>>> [0x7f409f2507c3]
>>> [mir2:10218] [ 6]
>>> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(PMPI_Finalize+0x59)
>>> [0x7f409f268959]
>>> [mir2:10218] [ 7] ./hello(main+0x69) [0x4008fd]
>>> [mir2:10218] [ 8] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3a6d41ec5d]
>>> [mir2:10218] [ 9] ./hello() [0x4007d9]
>>> [mir2:10218] *** End of error message ***
>>> --
>>> mpirun noticed that process rank 0 with PID 5542 on node mir1 exited
>>> on signal 11 (Segmentation fault).
>>> -
>>> 
>>> Thanks,
>>> Andrew Senin
>>> 

Re: [OMPI users] system() call corrupts MPI processes

2012-01-19 Thread Jeff Squyres
Which network transport are you using, and what version of Open MPI are you 
using?  Do you have OpenFabrics support compiled into your Open MPI 
installation?

If you're just using TCP and/or shared memory, I can't think of a reason 
immediately as to why this wouldn't work, but there may be a subtle interaction 
in there somewhere that causes badness (e.g., memory corruption).


On Jan 19, 2012, at 1:57 AM, Randolph Pullen wrote:

> 
> I have a section in my code running in rank 0 that must start a perl program 
> that it then connects to via a tcp socket.
> The initialisation section is shown here:
> 
> sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
> int i = system(buf);
> printf("system returned %d\n", i);
> 
> 
> Some time after I run this code, while waiting for the data from the perl 
> program, the error below occurs:
> 
> qplan connection
> DCsession_fetch: waiting for Mcode data...
> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
> to a process whose contact information is unknown in file rml_oob_send.c at 
> line 105
> [dc1:05387] [[40050,1],0] could not get route to [[INVALID],INVALID]
> [dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent 
> to a process whose contact information is unknown in file 
> base/plm_base_proxy.c at line 86
> 
> 
> It seems that the linux system() call is breaking OpenMPI internal 
> connections.  Removing the system() call and executing the perl code 
> externaly fixes the problem but I can't go into production like that as its a 
> security issue.
> 
> Any ideas ?
> 
> (environment: OpenMPI 1.4.1 on kernel Linux dc1 
> 2.6.18-274.3.1.el5.028stab094.3  using TCP and mpirun)
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] FW: mpirun hangs when used on more than 2 CPUs ( mpirun compiled without thread support )

2012-01-19 Thread Jeff Squyres
The thought occurs to me... (disclaimer: I know just about zero about OpenFoam 
and how to install/use it)

If your customer has been dealing with binaries, I wonder if there is some kind 
of ABI incompatibility going on here.  Open MPI did not provide any ABI 
guarantees until Open MPI v1.3.2 -- see 
http://www.open-mpi.org/software/ompi/versions/ for details.

Also, Open MPI v1.3.2 is a bit old.  There have been many bug fixes since then 
-- 1.4.4 is the latest stable.  There will be a 1.4.5 shortly, but that will be 
the last on the 1.4 series.


On Jan 19, 2012, at 5:51 AM, Theiner, Andre wrote:

> Hi all,
> I have to stop my investigations and repairs on the request of my customer.
> I will unsubscribe from this list soon.
> 
> I found out that OpenFoam does not use threaded MPI-calls.
> My next step would have been to compile openmpi-1.4.4 and have the user try 
> this.
> In case it would have also not worked I would have compiled the whole 
> OpenFoam from the sources.
> Up to now the user uses a rpm binary version of OF 2.0.1.
> 
> Thanks for all your  support.
> 
> 
> Andre
> 
> 
> -Original Message-
> From: Theiner, Andre
> Sent: Mittwoch, 18. Januar 2012 10:15
> To: 'Open MPI Users'
> Subject: RE: [OMPI users] mpirun hangs when used on more than 2 CPUs ( mpirun 
> compiled without thread support )
> Importance: High
> 
> Thanks, Jeff and Ralph for your good help.
> I do not know yet, whether OpenFoam uses threads with OpenMPI but I will find 
> out.
> 
> I ran "ompi_info" and it output the lines in the next chapter.
> The important line is " Thread support: posix (mpi: no, progress: no)".
> At first sight the above line made me think that I found the cause of the 
> problem
> but I compared the output to the output of the same command run on another 
> machine
> where OpenFoam runs fine. The OpenMPI version of that machine is 1.3.2-1.1 
> and it
> also does not have thread support.
> The difference though is that that machine's OpenFoam version is 1.7.1 and 
> not 2.0.1 and the
> OS is SUSE Linux Enterprise Desktop 11 SP1 and not openSUSE 11.3.
> So I am at the beginning of the search for the cause of the problem.
> 
> Package: Open MPI abuild@build30 Distribution
>Open MPI: 1.3.2
>   Open MPI SVN revision: r21054
>   Open MPI release date: Apr 21, 2009
>Open RTE: 1.3.2
>   Open RTE SVN revision: r21054
>   Open RTE release date: Apr 21, 2009
>OPAL: 1.3.2
>   OPAL SVN revision: r21054
>   OPAL release date: Apr 21, 2009
>Ident string: 1.3.2
>  Prefix: /usr/lib64/mpi/gcc/openmpi
> Configured architecture: x86_64-unknown-linux-gnu
>  Configure host: build30
>   Configured by: abuild
>   Configured on: Fri Sep 23 05:58:54 UTC 2011
>  Configure host: build30
>Built by: abuild
>Built on: Fri Sep 23 06:11:31 UTC 2011
>  Built host: build30
>  C bindings: yes
>C++ bindings: yes
>  Fortran77 bindings: yes (all)
>  Fortran90 bindings: yes
> Fortran90 bindings size: small
>  C compiler: gcc
> C compiler absolute: /usr/bin/gcc
>C++ compiler: g++
>   C++ compiler absolute: /usr/bin/g++
>  Fortran77 compiler: gfortran
>  Fortran77 compiler abs: /usr/bin/gfortran
>  Fortran90 compiler: gfortran
>  Fortran90 compiler abs: /usr/bin/gfortran
> C profiling: yes
>   C++ profiling: yes
> Fortran77 profiling: yes
> Fortran90 profiling: yes
>  C++ exceptions: no
>  Thread support: posix (mpi: no, progress: no)
>   Sparse Groups: no
>  Internal debug support: no
> MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
> libltdl support: yes
>   Heterogeneous support: no
> mpirun default --prefix: no
> MPI I/O support: yes
>   MPI_WTIME support: gettimeofday
> Symbol visibility support: yes
>   FT Checkpoint support: no  (checkpoint thread: no)
>   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.2)
>  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.2)
>   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.2)
>   MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.2)
>   MCA carto: file (MCA v2.0, API v2.0, Component v1.3.2)
>   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.2)
>   MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.2)
> MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.2)
> MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.2)
> MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.2)
>  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.2)
>   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.2)
>   MCA allocator: bucket (MCA v2.0, API 

Re: [OMPI users] mpirun hangs when used on more than 2 CPUs ( mpirun compiled without thread support )

2012-01-19 Thread Jeff Squyres
On Jan 18, 2012, at 4:15 AM, Theiner, Andre wrote:

> I also have requested the user to run the following adaption to his original
> command "mpriun -np 9 interFoam -parallel". I hoped to get a kind of debug 
> output
> which points me into the right way. The new command did not work and I am a 
> bit lost.
> Is the syntax wrong somehow or is there a problem in the user's PATH?
> I do not understand what debugger is wanted. Does mpirun not have an internal 
> debugger?
> 
> testuser@caelde04:~/OpenFOAM/testuser-2.0.1/nozzleFlow2D> mpirun -v --debug 
> --debug-daemons -np 9 interfoam -parallel
> --
> A suitable debugger could not be found in your PATH.
> Check the values specified in the orte_base_user_debugger MCA parameter for 
> the list of debuggers that was searched.

The --debug option is probably not doing what you think in this case.  Here's 
what the man page says:

   For debugging:

   -debug, --debug
  Invoke   theuser-leveldebuggerindicatedbythe
  orte_base_user_debugger MCA parameter.

   -debugger, --debugger
  Sequence  of  debuggers to search for when --debug is used (i.e.
  a synonym for orte_base_user_debugger MCA parameter).

   -tv, --tv
  Launch processes under the TotalView debugger.  Deprecated back-
  wards compatibility flag. Synonym for --debug.

Hence, the --debug flag is trying to invoke a parallel debugger.  It doesn't 
actually show debug-level information about what is happening in mpirun or the 
MPI processes.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] FW: mpirun hangs when used on more than 2 CPUs ( mpirun compiled without thread support )

2012-01-19 Thread Theiner, Andre
Hi all,
I have to stop my investigations and repairs on the request of my customer.
I will unsubscribe from this list soon.

I found out that OpenFoam does not use threaded MPI-calls.
My next step would have been to compile openmpi-1.4.4 and have the user try 
this.
In case it would have also not worked I would have compiled the whole OpenFoam 
from the sources.
Up to now the user uses a rpm binary version of OF 2.0.1.

Thanks for all your  support.


Andre


-Original Message-
From: Theiner, Andre
Sent: Mittwoch, 18. Januar 2012 10:15
To: 'Open MPI Users'
Subject: RE: [OMPI users] mpirun hangs when used on more than 2 CPUs ( mpirun 
compiled without thread support )
Importance: High

Thanks, Jeff and Ralph for your good help.
I do not know yet, whether OpenFoam uses threads with OpenMPI but I will find 
out.

I ran "ompi_info" and it output the lines in the next chapter.
The important line is " Thread support: posix (mpi: no, progress: no)".
At first sight the above line made me think that I found the cause of the 
problem
but I compared the output to the output of the same command run on another 
machine
where OpenFoam runs fine. The OpenMPI version of that machine is 1.3.2-1.1 and 
it
also does not have thread support.
The difference though is that that machine's OpenFoam version is 1.7.1 and not 
2.0.1 and the
OS is SUSE Linux Enterprise Desktop 11 SP1 and not openSUSE 11.3.
So I am at the beginning of the search for the cause of the problem.

 Package: Open MPI abuild@build30 Distribution
Open MPI: 1.3.2
   Open MPI SVN revision: r21054
   Open MPI release date: Apr 21, 2009
Open RTE: 1.3.2
   Open RTE SVN revision: r21054
   Open RTE release date: Apr 21, 2009
OPAL: 1.3.2
   OPAL SVN revision: r21054
   OPAL release date: Apr 21, 2009
Ident string: 1.3.2
  Prefix: /usr/lib64/mpi/gcc/openmpi
 Configured architecture: x86_64-unknown-linux-gnu
  Configure host: build30
   Configured by: abuild
   Configured on: Fri Sep 23 05:58:54 UTC 2011
  Configure host: build30
Built by: abuild
Built on: Fri Sep 23 06:11:31 UTC 2011
  Built host: build30
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (all)
  Fortran90 bindings: yes
 Fortran90 bindings size: small
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
  Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
  Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: yes
 Fortran90 profiling: yes
  C++ exceptions: no
  Thread support: posix (mpi: no, progress: no)
   Sparse Groups: no
  Internal debug support: no
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
 libltdl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
 MPI I/O support: yes
   MPI_WTIME support: gettimeofday
Symbol visibility support: yes
   FT Checkpoint support: no  (checkpoint thread: no)
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.2)
  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.2)
   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.2)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.2)
   MCA carto: file (MCA v2.0, API v2.0, Component v1.3.2)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.2)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.2)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.2)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.2)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.2)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.2)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.2)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: self (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.2)
  MCA io: romio (MCA v2.0, API v2.0, Component v1.3.2)
   MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.2)
   MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.2)
   MCA mpool: sm (MCA 

Re: [OMPI users] How to configure Intel Visual Fortran to work with OpenMPI

2012-01-19 Thread Shiqing Fan

Hi,

I made a binary package based on the current 1.5 rc2. Please find it in 
the download link here: http://db.tt/AXJF40ph. It should work on your 
win XP. Thanks a lot for testing it.



Regards,
Shiqing



On 2012-01-19 8:36 AM, Robert garcia wrote:

Thanks for clarification,

For now I 'll stick with current version of Win32 XP , if it's not a 
heavy workload, could you build a binary package ?


Regards,



Date: Wed, 18 Jan 2012 23:38:06 +0100
From: f...@hlrs.de
To: us...@open-mpi.org
CC: robertgarcia2...@hotmail.com
Subject: Re: [OMPI users] How to configure Intel Visual Fortran to 
work with OpenMPI


Hi Robert,

This is a known issue. The released binary was built on Windows Server 
2008,  which has newer Windows system dependencies. We have fixed this 
problem and it is included for the next release. If you don't want to 
switch to another Windows version, I can build one working binary 
package and pass it to you off-list, meanwhile, you can also wait for 
the upcoming release.



Best Regards,
Shiqing


On 2012-01-18 6:21 PM, Robert garcia wrote:

hi all,

I'm experiencing a difficulty when I'm trying to run the fortran
code which which uses parallel processing. I started with
downloading the /"OpenMPI_v1.5.4-1_win32.exe/

"
 I've
configured everything in IVF to point out to the correct static
libs and search paths, also set the enviromental variables Path to
directory where OpenMPI dll's(e.g. libmpi.dll, libmpid.dll
etc..) reside. Program compiles and links properly till the
message pops out like "The procedure entry point
InterlockedCompareExchange64 could not be located in the dynamic
link library KERNEL32.dll"
What can be the problem ?
Regards,



___
users mailing list
us...@open-mpi.org  
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
---
Shiqing Fan
High Performance Computing Center Stuttgart (HLRS)
Tel: ++49(0)711-685-87234  Nobelstrasse 19
Fax: ++49(0)711-685-65832  70569 Stuttgart
http://www.hlrs.de/organization/people/shiqing-fan/
email:f...@hlrs.de  



--
---
Shiqing Fan
High Performance Computing Center Stuttgart (HLRS)
Tel: ++49(0)711-685-87234  Nobelstrasse 19
Fax: ++49(0)711-685-65832  70569 Stuttgart
http://www.hlrs.de/organization/people/shiqing-fan/
email: f...@hlrs.de



Re: [OMPI users] How to configure Intel Visual Fortran to work with OpenMPI

2012-01-19 Thread Robert garcia

Thanks for clarification, 

For now I 'll stick with current version of Win32 XP , if it's not a heavy 
workload, could you build a binary package ?

Regards, 




List-Post: users@lists.open-mpi.org
Date: Wed, 18 Jan 2012 23:38:06 +0100
From: f...@hlrs.de
To: us...@open-mpi.org
CC: robertgarcia2...@hotmail.com
Subject: Re: [OMPI users] How to configure Intel Visual Fortran to work with 
OpenMPI


Hi Robert,

This is a known issue. The released binary was built on Windows Server 2008,  
which has newer Windows system dependencies. We have fixed this problem and it 
is included for the next release. If you don't want to switch to another 
Windows version, I can build one working binary package and pass it to you 
off-list, meanwhile, you can also wait for the upcoming release.


Best Regards,
Shiqing


On 2012-01-18 6:21 PM, Robert garcia wrote: 





hi all, 

I'm experiencing a difficulty when I'm trying to run the fortran code which 
which uses parallel processing. I started with downloading the 
"OpenMPI_v1.5.4-1_win32.exe" I've configured everything in IVF to point out to 
the correct static libs and search paths, also set the enviromental variables 
Path to directory where OpenMPI dll's(e.g. libmpi.dll, libmpid.dll etc..) 
reside. Program compiles and links properly till the message pops out like "The 
procedure entry point InterlockedCompareExchange64 could not be located in the 
dynamic link library KERNEL32.dll"

What can be the problem ?

Regards, 




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
---
Shiqing Fan
High Performance Computing Center Stuttgart (HLRS)
Tel: ++49(0)711-685-87234  Nobelstrasse 19
Fax: ++49(0)711-685-65832  70569 Stuttgart
http://www.hlrs.de/organization/people/shiqing-fan/
email: f...@hlrs.de


[OMPI users] system() call corrupts MPI processes

2012-01-19 Thread Randolph Pullen

I have a section in my code running in rank 0 that must start a perl program 
that it then connects to via a tcp socket.
The initialisation section is shown here:

    sprintf(buf, "%s/session_server.pl -p %d &", PATH,port);
    int i = system(buf);
    printf("system returned %d\n", i);


Some time after I run this code, while waiting for the data from the perl 
program, the error below occurs:

qplan connection
DCsession_fetch: waiting for Mcode data...
[dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to 
a process whose contact information is unknown in file rml_oob_send.c at line 
105
[dc1:05387] [[40050,1],0] could not get route to [[INVALID],INVALID]
[dc1:05387] [[40050,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to 
a process whose contact information is unknown in file base/plm_base_proxy.c at 
line 86


It seems that the linux system() call is breaking OpenMPI internal connections. 
 Removing the system() call and executing the perl code externaly fixes the 
problem but I can't go into production like that as its a security issue.

Any ideas ?


(environment: OpenMPI 1.4.1 on kernel Linux dc1 2.6.18-274.3.1.el5.028stab094.3 
 using TCP and mpirun)