date:20100329

Could you try one of the 1.4.2 nightly tarballs and see if that makes the issue 
better?

http://www.open-mpi.org/nightly/v1.4/


On Mar 29, 2010, at 7:47 PM, Shaun Jackman wrote:

> Hi,
> 
> On an IA64 platform, MPI_Init never returns. I fired up GDB and it seems
> that ompi_free_list_grow never returns. My test program does nothing but
> call MPI_Init. Here's the backtrace:
> 
> (gdb) bt
> #0  0x20075620 in ompi_free_list_grow () from 
> /home/aubjtl/openmpi/lib/libmpi.so.0
> #1  0x20078e50 in ompi_rb_tree_init () from 
> /home/aubjtl/openmpi/lib/libmpi.so.0
> #2  0x20160840 in mca_mpool_base_tree_init () from 
> /home/aubjtl/openmpi/lib/libmpi.so.0
> #3  0x2015dac0 in mca_mpool_base_open () from 
> /home/aubjtl/openmpi/lib/libmpi.so.0
> #4  0x200bfd30 in ompi_mpi_init () from 
> /home/aubjtl/openmpi/lib/libmpi.so.0
> #5  0x2010efb0 in PMPI_Init () from 
> /home/aubjtl/openmpi/lib/libmpi.so.0
> #6  0x4b70 in main ()
> 
> Any suggestion how I can trouble shoot?
> 
> $ mpirun --version
> mpirun (Open MPI) 1.4.1
> $ ./config.guess
> ia64-unknown-linux-gnu
> 
> Thanks,
> Shaun
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

[OMPI users] MPI_Init never returns on IA64

2010-03-29 Thread Shaun Jackman

Hi,

On an IA64 platform, MPI_Init never returns. I fired up GDB and it seems
that ompi_free_list_grow never returns. My test program does nothing but
call MPI_Init. Here's the backtrace:

(gdb) bt
#0  0x20075620 in ompi_free_list_grow () from 
/home/aubjtl/openmpi/lib/libmpi.so.0
#1  0x20078e50 in ompi_rb_tree_init () from 
/home/aubjtl/openmpi/lib/libmpi.so.0
#2  0x20160840 in mca_mpool_base_tree_init () from 
/home/aubjtl/openmpi/lib/libmpi.so.0
#3  0x2015dac0 in mca_mpool_base_open () from 
/home/aubjtl/openmpi/lib/libmpi.so.0
#4  0x200bfd30 in ompi_mpi_init () from 
/home/aubjtl/openmpi/lib/libmpi.so.0
#5  0x2010efb0 in PMPI_Init () from /home/aubjtl/openmpi/lib/libmpi.so.0
#6  0x4b70 in main ()

Any suggestion how I can trouble shoot?

$ mpirun --version
mpirun (Open MPI) 1.4.1
$ ./config.guess
ia64-unknown-linux-gnu

Thanks,
Shaun

Re: [OMPI users] openMPI on Xgrid

2010-03-29 Thread Klymak Jody



I have an environment a few trusted users could use to test.  However,  
I have neither the expertise or time to do the debugging myself.


Cheers,  Jody

On 2010-03-29, at 1:27 PM, Jeff Squyres wrote:


On Mar 29, 2010, at 4:11 PM, Cristobal Navarro wrote:


i realized that xcode dev tools include openMPI 1.2.x
should i keep trying??
or do you recommend to completly abandon xgrid and go for another  
tool like Torque with openMPI?


FWIW, Open MPI v1.2.x is fairly ancient -- the v1.4 series includes  
a few years worth of improvements and bug fixes since the 1.2 series.


It would be great (hint hint) if someone could fix the xgrid support  
for us...  We simply no longer have anyone in the active development  
group who has the expertise or test environment to make our xgrid  
work.  :-(


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openMPI on Xgrid

On Mar 29, 2010, at 4:11 PM, Cristobal Navarro wrote:

> i realized that xcode dev tools include openMPI 1.2.x
> should i keep trying??
> or do you recommend to completly abandon xgrid and go for another tool like 
> Torque with openMPI?

FWIW, Open MPI v1.2.x is fairly ancient -- the v1.4 series includes a few years 
worth of improvements and bug fixes since the 1.2 series.

It would be great (hint hint) if someone could fix the xgrid support for us...  
We simply no longer have anyone in the active development group who has the 
expertise or test environment to make our xgrid work.  :-(

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] openMPI on Xgrid

at least it would be a good exercise to complete the process with xgrid +
openMPI for the knowledge

Cristobal




On Mon, Mar 29, 2010 at 4:11 PM, Cristobal Navarro wrote:

> i realized that xcode dev tools include openMPI 1.2.x
> should i keep trying??
> or do you recommend to completly abandon xgrid and go for another tool like
> Torque with openMPI?
>
>
>
>
> On Mon, Mar 29, 2010 at 3:48 PM, Jody Klymak  wrote:
>
>>
>> On Mar 29, 2010, at  12:39 PM, Ralph Castain wrote:
>>
>>
>> On Mar 29, 2010, at 1:34 PM, Cristobal Navarro wrote:
>>
>> thanks for the information,
>>
>> but is it possible to make it work with xgrid or the 1.4.1 version just
>> dont support it?
>>
>>
>> FWIW, I've had excellent success with Torque and openmpi on OS-X 10.5
>> Server.
>>
>> http://www.clusterresources.com/products/torque-resource-manager.php
>>
>> It doesn't have a nice dashboard, but the queue tools are more than
>> adequate for my needs.
>>
>> Open MPI had a funny port issue on my setup that folks helped with
>>
>> From my notes:
>>
>> Edited /Network/Xgrid/openmpi/etc/openmpi-mca-params.conf to make sure
>> that the right ports are used:
>>
>> 
>> # set ports so that they are more valid than the default ones (see email
>> from Ralph Castain)
>> btl_tcp_port_min_v4 = 36900
>> btl_tcp_port_range  = 32
>> 
>>
>> Cheers,  Jody
>>
>>
>>  --
>> Jody Klymak
>> http://web.uvic.ca/~jklymak/
>>
>>
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>

Re: [OMPI users] openMPI on Xgrid

i realized that xcode dev tools include openMPI 1.2.x
should i keep trying??
or do you recommend to completly abandon xgrid and go for another tool like
Torque with openMPI?




On Mon, Mar 29, 2010 at 3:48 PM, Jody Klymak  wrote:

>
> On Mar 29, 2010, at  12:39 PM, Ralph Castain wrote:
>
>
> On Mar 29, 2010, at 1:34 PM, Cristobal Navarro wrote:
>
> thanks for the information,
>
> but is it possible to make it work with xgrid or the 1.4.1 version just
> dont support it?
>
>
> FWIW, I've had excellent success with Torque and openmpi on OS-X 10.5
> Server.
>
> http://www.clusterresources.com/products/torque-resource-manager.php
>
> It doesn't have a nice dashboard, but the queue tools are more than
> adequate for my needs.
>
> Open MPI had a funny port issue on my setup that folks helped with
>
> From my notes:
>
> Edited /Network/Xgrid/openmpi/etc/openmpi-mca-params.conf to make sure
> that the right ports are used:
>
> 
> # set ports so that they are more valid than the default ones (see email
> from Ralph Castain)
> btl_tcp_port_min_v4 = 36900
> btl_tcp_port_range  = 32
> 
>
> Cheers,  Jody
>
>
>  --
> Jody Klymak
> http://web.uvic.ca/~jklymak/
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] OPEN_MPI macro for mpif.h?

2010-03-29 Thread Martin Bernreuther

Hello,

looking at the Open MPI mpi.h include file there's a preprocessor macro
OPEN_MPI defined, as well as e.g. OMPI_MAJOR_VERSION, OMPI_MINOR_VERSION
and OMPI_RELEASE_VERSION. version.h e.g. also defines OMPI_VERSION
This seems to be missing in mpif.h and therefore something like

include 'mpif.h'
[...]
#ifdef OPEN_MPI
   write( *, '("MPI library: OpenMPI",I2,".",I2,".",I2)' ) &
&OMPI_MAJOR_VERSION, OMPI_MINOR_VERSION, OMPI_RELEASE_VERSION
#endif

doesn't work for a FORTRAN openmpi program. 

Which are the Open MPI specific preprocessor macros to be used for the
Fortran binding?

Thanks,
Martin
-- 

   Dr.-Ing. Martin Bernreuther
   University of Stuttgart
   High Performance Computing Center (HLRS)
   Nobelstrasse 19 (Office: Allmandring 30, 0.032)
   70569 Stuttgart, Germany

   Phone: (++49-(0)711) 685-64542, Fax: (++49-(0)711) 685-65832
   E-Mail: bernreut...@hlrs.de
   URL: http://www.hlrs.de/people/bernreuther/

Re: [OMPI users] openMPI on Xgrid

2010-03-29 Thread Jody Klymak



On Mar 29, 2010, at  12:39 PM, Ralph Castain wrote:



On Mar 29, 2010, at 1:34 PM, Cristobal Navarro wrote:


thanks for the information,

but is it possible to make it work with xgrid or the 1.4.1 version  
just dont support it?




FWIW, I've had excellent success with Torque and openmpi on OS-X 10.5  
Server.


http://www.clusterresources.com/products/torque-resource-manager.php

It doesn't have a nice dashboard, but the queue tools are more than  
adequate for my needs.


Open MPI had a funny port issue on my setup that folks helped with

From my notes:

Edited /Network/Xgrid/openmpi/etc/openmpi-mca-params.conf to make sure
that the right ports are used:


# set ports so that they are more valid than the default ones (see  
email from Ralph Castain)

btl_tcp_port_min_v4 = 36900
btl_tcp_port_range  = 32


Cheers,  Jody


--
Jody Klymak
http://web.uvic.ca/~jklymak/

Re: [OMPI users] openMPI on Xgrid

2010-03-29 Thread Ralph Castain


On Mar 29, 2010, at 1:34 PM, Cristobal Navarro wrote:

> thanks for the information,
> 
> but is it possible to make it work with xgrid or the 1.4.1 version just dont 
> support it?
> 
> 

I'm afraid it just doesn't support it - we made the support compile, but we 
have no way to test/debug the operation, so it is turned "off".


> 
> 
> On Mon, Mar 29, 2010 at 3:07 PM, Ralph Castain  wrote:
> Our xgrid support has been broken for some time now due to lack of access to 
> a test environment. So your system is using rsh/ssh instead.
> 
> Until we get someone interested in xgrid, or at least willing to debug it and 
> tell us what needs to be done, I'm afraid our xgrid support will be lacking.
> 
> 
> On Mar 29, 2010, at 12:56 PM, Cristobal Navarro wrote:
> 
>> Hello,
>> i am new on this mailing list!
>> i've read the other messages about configuring openMPI on Xgrid, but i 
>> havent solved my problem yet and openMPI keeps running as if Xgrid didnt 
>> exist.
>> 
>> i configured xgrid properly, and can send simple C program jobs trough the 
>> command line from my client, which is the same as the controller and the 
>> same as the agent for the moment.
>> >> xgrid -h localhost -p pass -job run ./helloWorld
>> i also installed xgrid Admin for monitoring.
>> 
>> then,
>>  i compiled openMPI 1.4.1 with these options
>> 
>> /configure --prefix=/usr/local/openmpi/ --enable-shared --disable-static 
>> --with-xgrid
>> sudo make
>> sudo make install
>> 
>> and i made a simple helloMPI example.
>> 
>> 
>> /* MPI C Example */
>> #include 
>> #include 
>> 
>> int main (argc, argv)
>>   int argc;
>>   char *argv[];
>> {
>> int rank, size;
>> 
>> MPI_Init (, );   /* starts MPI */
>> MPI_Comm_rank (MPI_COMM_WORLD, ); /* get current process id */
>> MPI_Comm_size (MPI_COMM_WORLD, ); /* get number of processes */
>> printf( "Hello world from process %d of %d\n", rank, size );
>> MPI_Finalize();
>> return 0;
>> }
>> 
>> and compiled succesfully
>> 
>> >> mpicc hellompi.c -o hellompi
>> 
>> the i run it
>> 
>> >> mpirun -np 2 hellompi
>> I am running on ijorge.local
>> Hello World from process 0 of 2
>> I am running on ijorge.local
>> Hello World from process 1 of 2
>> 
>> the results are correct, but when i check Xgrid Admin, i see that the 
>> execution didnt go trought Xgrid since there arent any new jobs on the list.
>> in the end, openMPI and Xgrid are not comunicating to each other.
>> 
>> what am i missing??
>> 
>> my enviroment variables are these:
>> 
>> >>echo $XGRID_CONTROLLER_HOSTNAME
>> ijorge.local
>> >>echo $XGRID_CONTROLLER_PASSWORD
>> myPassword
>> 
>> 
>> any help is welcome!!
>> thanks in advance
>> 
>> Cristobal
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openMPI on Xgrid

thanks for the information,

but is it possible to make it work with xgrid or the 1.4.1 version just dont
support it?




On Mon, Mar 29, 2010 at 3:07 PM, Ralph Castain  wrote:

> Our xgrid support has been broken for some time now due to lack of access
> to a test environment. So your system is using rsh/ssh instead.
>
> Until we get someone interested in xgrid, or at least willing to debug it
> and tell us what needs to be done, I'm afraid our xgrid support will be
> lacking.
>
>
> On Mar 29, 2010, at 12:56 PM, Cristobal Navarro wrote:
>
> Hello,
> i am new on this mailing list!
> i've read the other messages about configuring openMPI on Xgrid, but i
> havent solved my problem yet and openMPI keeps running as if Xgrid didnt
> exist.
>
> i configured xgrid properly, and can send simple C program jobs trough the
> command line from my client, which is the same as the controller and the
> same as the agent for the moment.
> >> xgrid -h localhost -p pass -job run ./helloWorld
> i also installed xgrid Admin for monitoring.
>
> then,
>  i compiled openMPI 1.4.1 with these options
>
> /configure --prefix=/usr/local/openmpi/ --enable-shared --disable-static
> --with-xgrid
> sudo make
> sudo make install
>
> and i made a simple helloMPI example.
>
>
> /* MPI C Example */
> #include 
> #include 
>
> int main (argc, argv)
>   int argc;
>   char *argv[];
> {
> int rank, size;
>
> MPI_Init (, );   /* starts MPI */
> MPI_Comm_rank (MPI_COMM_WORLD, ); /* get current process id */
> MPI_Comm_size (MPI_COMM_WORLD, ); /* get number of processes */
> printf( "Hello world from process %d of %d\n", rank, size );
> MPI_Finalize();
> return 0;
> }
>
>
> and compiled succesfully
>
> >> mpicc hellompi.c -o hellompi
>
> the i run it
>
> >> mpirun -np 2 hellompi
> I am running on ijorge.local
> Hello World from process 0 of 2
> I am running on ijorge.local
> Hello World from process 1 of 2
>
> the results are correct, but when i check Xgrid Admin, i see that the
> execution didnt go trought Xgrid since there arent any new jobs on the list.
> in the end, openMPI and Xgrid are not comunicating to each other.
>
> what am i missing??
>
> my enviroment variables are these:
>
> >>echo $XGRID_CONTROLLER_HOSTNAME
> ijorge.local
> >>echo $XGRID_CONTROLLER_PASSWORD
> myPassword
>
>
> any help is welcome!!
> thanks in advance
>
> Cristobal
>
>
>  ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] openMPI on Xgrid

2010-03-29 Thread Ralph Castain

Our xgrid support has been broken for some time now due to lack of access to a 
test environment. So your system is using rsh/ssh instead.

Until we get someone interested in xgrid, or at least willing to debug it and 
tell us what needs to be done, I'm afraid our xgrid support will be lacking.


On Mar 29, 2010, at 12:56 PM, Cristobal Navarro wrote:

> Hello,
> i am new on this mailing list!
> i've read the other messages about configuring openMPI on Xgrid, but i havent 
> solved my problem yet and openMPI keeps running as if Xgrid didnt exist.
> 
> i configured xgrid properly, and can send simple C program jobs trough the 
> command line from my client, which is the same as the controller and the same 
> as the agent for the moment.
> >> xgrid -h localhost -p pass -job run ./helloWorld
> i also installed xgrid Admin for monitoring.
> 
> then,
>  i compiled openMPI 1.4.1 with these options
> 
> /configure --prefix=/usr/local/openmpi/ --enable-shared --disable-static 
> --with-xgrid
> sudo make
> sudo make install
> 
> and i made a simple helloMPI example.
> 
> 
> /* MPI C Example */
> #include 
> #include 
> 
> int main (argc, argv)
>   int argc;
>   char *argv[];
> {
> int rank, size;
> 
> MPI_Init (, );   /* starts MPI */
> MPI_Comm_rank (MPI_COMM_WORLD, ); /* get current process id */
> MPI_Comm_size (MPI_COMM_WORLD, ); /* get number of processes */
> printf( "Hello world from process %d of %d\n", rank, size );
> MPI_Finalize();
> return 0;
> }
> 
> and compiled succesfully
> 
> >> mpicc hellompi.c -o hellompi
> 
> the i run it
> 
> >> mpirun -np 2 hellompi
> I am running on ijorge.local
> Hello World from process 0 of 2
> I am running on ijorge.local
> Hello World from process 1 of 2
> 
> the results are correct, but when i check Xgrid Admin, i see that the 
> execution didnt go trought Xgrid since there arent any new jobs on the list.
> in the end, openMPI and Xgrid are not comunicating to each other.
> 
> what am i missing??
> 
> my enviroment variables are these:
> 
> >>echo $XGRID_CONTROLLER_HOSTNAME
> ijorge.local
> >>echo $XGRID_CONTROLLER_PASSWORD
> myPassword
> 
> 
> any help is welcome!!
> thanks in advance
> 
> Cristobal
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] openMPI on Xgrid

Hello,
i am new on this mailing list!
i've read the other messages about configuring openMPI on Xgrid, but i
havent solved my problem yet and openMPI keeps running as if Xgrid didnt
exist.

i configured xgrid properly, and can send simple C program jobs trough the
command line from my client, which is the same as the controller and the
same as the agent for the moment.
>> xgrid -h localhost -p pass -job run ./helloWorld
i also installed xgrid Admin for monitoring.

then,
 i compiled openMPI 1.4.1 with these options

/configure --prefix=/usr/local/openmpi/ --enable-shared --disable-static
--with-xgrid
sudo make
sudo make install

and i made a simple helloMPI example.


/* MPI C Example */
#include 
#include 

int main (argc, argv)
  int argc;
  char *argv[];
{
int rank, size;

MPI_Init (, );   /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, ); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, ); /* get number of processes */
printf( "Hello world from process %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}


and compiled succesfully

>> mpicc hellompi.c -o hellompi

the i run it

>> mpirun -np 2 hellompi
I am running on ijorge.local
Hello World from process 0 of 2
I am running on ijorge.local
Hello World from process 1 of 2

the results are correct, but when i check Xgrid Admin, i see that the
execution didnt go trought Xgrid since there arent any new jobs on the list.
in the end, openMPI and Xgrid are not comunicating to each other.

what am i missing??

my enviroment variables are these:

>>echo $XGRID_CONTROLLER_HOSTNAME
ijorge.local
>>echo $XGRID_CONTROLLER_PASSWORD
myPassword


any help is welcome!!
thanks in advance

Cristobal

Re: [OMPI users] Segmentation fault (11)

2010-03-29 Thread Jean Potsam

Hi Josh/All,
   I just tested a simple c application with blcr and it worked 
fine.
 
##
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include
#include  
#include 

char * getprocessid() 
{
    FILE * read_fp;
    char buffer[BUFSIZ + 1];
    int chars_read;
    char * buffer_data="12345";
    memset(buffer, '\0', sizeof(buffer));
  read_fp = popen("uname -a", "r");
 /*
  ...
 */ 
 return buffer_data;
}
 
int main(int argc, char ** argv)
{

 int rank;
   int size;
char * thedata;
int n=0;

 thedata=getprocessid();
 printf(" the data is %s", thedata);
 
  while( n <10)
  {
    printf("value is %d\n", n);
    n++;
    sleep(1);
   }
 printf("bye\n");
 
}
 
 
jean@sun32:/tmp$ cr_run ./pipetest3 &
[1] 31807
jean@sun32:~$  the data is 12345value is 0
value is 1
value is 2
...
value is 9
bye
 
jean@sun32:/tmp$ cr_checkpoint 31807
 
jean@sun32:/tmp$ cr_restart context.31807
value is 7
value is 8
value is 9
bye
 
##
 
 
It looks like its more to do with Openmpi.  Any ideas from you side?
 
Thank you.
 
Kind regards,
 
Jean.
 
 

 


--- On Mon, 29/3/10, Josh Hursey  wrote:


From: Josh Hursey 
Subject: Re: [OMPI users] Segmentation fault (11)
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Monday, 29 March, 2010, 16:08


I wonder if this is a bug with BLCR (since the segv stack is in the BLCR 
thread). Can you try an non-MPI version of this application that uses popen(), 
and see if BLCR properly checkpoints/restarts it?

If so, we can start to see what Open MPI might be doing to confuse things, but 
I suspect that this might be a bug with BLCR. Either way let us know what you 
find out.

Cheers,
Josh

On Mar 27, 2010, at 6:17 AM, jody wrote:

> I'm not sure if this is the cause of your problems:
> You define the constant BUFFER_SIZE, but in the code you use a constant 
> called BUFSIZ...
> Jody
> 
> 
> On Fri, Mar 26, 2010 at 10:29 PM, Jean Potsam  wrote:
> Dear All,
>               I am having a problem with openmpi . I have installed openmpi 
>1.4 and blcr 0.8.1
> 
> I have written a small mpi application as follows below:
> 
> ###
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include
> #include 
> #include 
> 
> #define BUFFER_SIZE PIPE_BUF
> 
> char * getprocessid()
> {
>     FILE * read_fp;
>     char buffer[BUFSIZ + 1];
>     int chars_read;
>     char * buffer_data="12345";
>     memset(buffer, '\0', sizeof(buffer));
>   read_fp = popen("uname -a", "r");
>      /*
>       ...
>  */
>      return buffer_data;
> }
> 
> int main(int argc, char ** argv)
> {
>   MPI_Status status;
>  int rank;
>    int size;
> char * thedata;
>     MPI_Init(, );
>     MPI_Comm_size(MPI_COMM_WORLD,);
>     MPI_Comm_rank(MPI_COMM_WORLD,);
>  thedata=getprocessid();
>  printf(" the data is %s", thedata);
>     MPI_Finalize();
> }
> 
> 
> I get the following result:
> 
> ###
> jean@sunn32:~$ mpicc pipetest2.c -o pipetest2
> jean@sunn32:~$ mpirun -np 1 -am ft-enable-cr -mca btl ^openib  pipetest2
> [sun32:19211] *** Process received signal ***
> [sun32:19211] Signal: Segmentation fault (11)
> [sun32:19211] Signal code: Address not mapped (1)
> [sun32:19211] Failing at address: 0x4
> [sun32:19211] [ 0] [0xb7f3c40c]
> [sun32:19211] [ 1] /lib/libc.so.6(cfree+0x3b) [0xb796868b]
> [sun32:19211] [ 2] /usr/local/blcr/lib/libcr.so.0(cri_info_free+0x2a) 
> [0xb7a5925a]
> [sun32:19211] [ 3] /usr/local/blcr/lib/libcr.so.0 [0xb7a5ac72]
> [sun32:19211] [ 4] /lib/libc.so.6(__libc_fork+0x186) [0xb7991266]
> [sun32:19211] [ 5] /lib/libc.so.6(_IO_proc_open+0x7e) [0xb7958b6e]
> [sun32:19211] [ 6] /lib/libc.so.6(popen+0x6c) [0xb7958dfc]
> [sun32:19211] [ 7] pipetest2(getprocessid+0x42) [0x8048836]
> [sun32:19211] [ 8] pipetest2(main+0x4d) [0x8048897]
> [sun32:19211] [ 9] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7912455]
> [sun32:19211] [10] pipetest2 [0x8048761]
> [sun32:19211] *** End of error message ***
> #
> 
> 
> However, If I compile the application using gcc, it works fine. The problem 
> arises with:
>   read_fp = popen("uname -a", "r");
> 
> Does anyone has an idea how to resolve this problem?
> 
> Many thanks
> 
> Jean
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segmentation fault (11)

I wonder if this is a bug with BLCR (since the segv stack is in the  
BLCR thread). Can you try an non-MPI version of this application that  
uses popen(), and see if BLCR properly checkpoints/restarts it?


If so, we can start to see what Open MPI might be doing to confuse  
things, but I suspect that this might be a bug with BLCR. Either way  
let us know what you find out.


Cheers,
Josh

On Mar 27, 2010, at 6:17 AM, jody wrote:


I'm not sure if this is the cause of your problems:
You define the constant BUFFER_SIZE, but in the code you use a  
constant called BUFSIZ...

Jody


On Fri, Mar 26, 2010 at 10:29 PM, Jean Potsam  
 wrote:

Dear All,
  I am having a problem with openmpi . I have installed  
openmpi 1.4 and blcr 0.8.1


I have written a small mpi application as follows below:

###
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include
#include 
#include 

#define BUFFER_SIZE PIPE_BUF

char * getprocessid()
{
FILE * read_fp;
char buffer[BUFSIZ + 1];
int chars_read;
char * buffer_data="12345";
memset(buffer, '\0', sizeof(buffer));
  read_fp = popen("uname -a", "r");
 /*
  ...
 */
 return buffer_data;
}

int main(int argc, char ** argv)
{
  MPI_Status status;
 int rank;
   int size;
char * thedata;
MPI_Init(, );
MPI_Comm_size(MPI_COMM_WORLD,);
MPI_Comm_rank(MPI_COMM_WORLD,);
 thedata=getprocessid();
 printf(" the data is %s", thedata);
MPI_Finalize();
}


I get the following result:

###
jean@sunn32:~$ mpicc pipetest2.c -o pipetest2
jean@sunn32:~$ mpirun -np 1 -am ft-enable-cr -mca btl ^openib   
pipetest2

[sun32:19211] *** Process received signal ***
[sun32:19211] Signal: Segmentation fault (11)
[sun32:19211] Signal code: Address not mapped (1)
[sun32:19211] Failing at address: 0x4
[sun32:19211] [ 0] [0xb7f3c40c]
[sun32:19211] [ 1] /lib/libc.so.6(cfree+0x3b) [0xb796868b]
[sun32:19211] [ 2] /usr/local/blcr/lib/libcr.so.0(cri_info_free 
+0x2a) [0xb7a5925a]

[sun32:19211] [ 3] /usr/local/blcr/lib/libcr.so.0 [0xb7a5ac72]
[sun32:19211] [ 4] /lib/libc.so.6(__libc_fork+0x186) [0xb7991266]
[sun32:19211] [ 5] /lib/libc.so.6(_IO_proc_open+0x7e) [0xb7958b6e]
[sun32:19211] [ 6] /lib/libc.so.6(popen+0x6c) [0xb7958dfc]
[sun32:19211] [ 7] pipetest2(getprocessid+0x42) [0x8048836]
[sun32:19211] [ 8] pipetest2(main+0x4d) [0x8048897]
[sun32:19211] [ 9] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7912455]
[sun32:19211] [10] pipetest2 [0x8048761]
[sun32:19211] *** End of error message ***
#


However, If I compile the application using gcc, it works fine. The  
problem arises with:

  read_fp = popen("uname -a", "r");

Does anyone has an idea how to resolve this problem?

Many thanks

Jean


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI



On Mar 29, 2010, at 11:53 AM, fengguang tian wrote:


hi

i have used the --term option,but the mpirun is still hanging,it is  
the same whether I include the ' / ' or not.I am installing the v1.4  
to see whether the problems are still there. I tried, but some  
problems are still there.


What configure options did you use when building Open MPI?



BTW, my MPI program will have some input file, and will generate  
some output file after some computation, it can be checkpointed,but  
when restart it, some error happened,have you met this kind of  
problem?


Try putting the 'snapc_base_global_snapshot_dir' in the $HOME/.openmpi/ 
mca-params.conf file instead of just on the command line. Like:

snapc_base_global_snapshot_dir=/shared-dir/

I suspect that ompi-restart is looking in the wrong place for your  
checkpoint. By default it will search $HOME (since that is the default  
for snapc_base_global_snapshot_dir). If you put this parameter in the  
mca-params.conf file, then it is always set in any tool (mpirun/ompi- 
checkpoint/ompi-restart) to the specified value. So ompi-restart will  
search the correct location for the checkpoint files.


-- Josh



cheers
fengguang

On Mon, Mar 29, 2010 at 11:42 AM, Josh Hursey  wrote:


On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote:

On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian  
 wrote:


I use mpirun -np 50 -am ft-enable-cr --mca  
snapc_base_global_snapshot_dir

--hostfile .mpihostfile 
to store the global checkpoint snapshot into the shared
directory:/mirror,but the problems are still there,
when ompi-checkpoint, the mpirun is still not killed,it is hanging
there.

So the 'ompi-checkpoint' command does not finish? By default 'ompi- 
checkpoint' does not terminate the MPI job. If you pass the '--term'  
option to it, then it will.




when doing ompi-restart, it shows:

mpiu@nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
--
Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid  
because

either you have not provided a filename
  or provided an invalid filename.
  Please see --help for usage.

--


Try removing the trailing '/' in the command. The current ompi- 
restart is not good about differentiating between :


 ompi_global_snapshot_333.ckpt
and

 ompi_global_snapshot_333.ckpt/


Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with
1.4 (but then I didn't try 1.4 with a shared filesystem).

I would also suggest trying v1.4 or 1.5 to see if your problems  
persist with these versions.


-- Josh




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster

2010-03-29 Thread fengguang tian

hi
I solve this problem, some previous versions of directories in the cluster
are not removed, after I remove them, it works fine. thank you

cheers
fengguang

On Mon, Mar 29, 2010 at 11:47 AM, Josh Hursey  wrote:

> Does this happen when you run without '-am ft-enable-cr' (so a no-C/R run)?
>
> This will help us determine if your problem is with the C/R work or with
> the ORTE runtime. I suspect that there is something odd with your system
> that is confusing the runtime (so not a C/R problem).
>
> Have you made sure to remove the previous versions of Open MPI from all
> machines on your cluster, before installing the new version? Sometimes
> problems like this come up because of mismatches in Open MPI versions on a
> machine.
>
> -- Josh
>
>
> On Mar 23, 2010, at 5:42 PM, fengguang tian wrote:
>
>  I met the same problem with this link:
>> http://www.open-mpi.org/community/lists/users/2009/12/11374.php
>>
>> in the link, they give a solution that use v1.4 open mpi instead of v1.3
>> open mpi. but, I am using v1.7a1r22794 open mpi, and met the same problem.
>> here is what I have done:
>> my cluster composed of two machines:nimbus(master) and nimbus1(slave),
>> when I run mpirun -np 40 -am ft-enable-cr --hostfile .mpihostfile
>> myapplication
>> on the nimbus, and it doesn't work, it shows:
>>
>> [nimbus1:21387] opal_os_dirpath_create: Error: Unable to create the
>> sub-directory (/tmp/openmpi-sessions-mpiu@nimbus1_0/59759) of
>> (/tmp/openmpi-sessions-mpiu@nimbus1_0/59759/0/1), mkdir failed [1]
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> util/session_dir.c at line 106
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> util/session_dir.c at line 399
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> base/ess_base_std_orted.c at line 301
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> rml_oob_send.c at line 104
>> [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> util/show_help.c at line 602
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> ess_env_module.c at line 143
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> rml_oob_send.c at line 104
>> [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> util/show_help.c at line 602
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> runtime/orte_init.c at line 129
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> rml_oob_send.c at line 104
>> [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> util/show_help.c at line 602
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> orted/orted_main.c at line 355
>> --
>> A daemon (pid 10737) died unexpectedly with status 255 while attempting
>> to launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --
>> --
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --
>>
>>
>> cheers
>> fengguang
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-29 Thread fengguang tian

hi

i have used the --term option,but the mpirun is still hanging,it is the same
whether I include the ' / ' or not.I am installing the v1.4 to see whether
the problems are still there. I tried, but some problems are still there.

BTW, my MPI program will have some input file, and will generate some output
file after some computation, it can be checkpointed,but when restart it,
some error happened,have you met this kind of problem?

cheers
fengguang

On Mon, Mar 29, 2010 at 11:42 AM, Josh Hursey  wrote:

>
> On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote:
>
>  On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian 
>> wrote:
>>
>>>
>>> I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir
>>> --hostfile .mpihostfile 
>>> to store the global checkpoint snapshot into the shared
>>> directory:/mirror,but the problems are still there,
>>> when ompi-checkpoint, the mpirun is still not killed,it is hanging
>>> there.
>>>
>>
> So the 'ompi-checkpoint' command does not finish? By default
> 'ompi-checkpoint' does not terminate the MPI job. If you pass the '--term'
> option to it, then it will.
>
>
>
>  when doing ompi-restart, it shows:
>>>
>>> mpiu@nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
>>>
>>> --
>>> Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because
>>> either you have not provided a filename
>>>   or provided an invalid filename.
>>>   Please see --help for usage.
>>>
>>>
>>> --
>>>
>>
>>
> Try removing the trailing '/' in the command. The current ompi-restart is
> not good about differentiating between :
>
>  ompi_global_snapshot_333.ckpt
> and
>
>  ompi_global_snapshot_333.ckpt/
>
>
>  Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with
>> 1.4 (but then I didn't try 1.4 with a shared filesystem).
>>
>
> I would also suggest trying v1.4 or 1.5 to see if your problems persist
> with these versions.
>
> -- Josh
>
>
>
>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster

Does this happen when you run without '-am ft-enable-cr' (so a no-C/R  
run)?


This will help us determine if your problem is with the C/R work or  
with the ORTE runtime. I suspect that there is something odd with your  
system that is confusing the runtime (so not a C/R problem).


Have you made sure to remove the previous versions of Open MPI from  
all machines on your cluster, before installing the new version?  
Sometimes problems like this come up because of mismatches in Open MPI  
versions on a machine.


-- Josh

On Mar 23, 2010, at 5:42 PM, fengguang tian wrote:


I met the same problem with this 
link:http://www.open-mpi.org/community/lists/users/2009/12/11374.php

in the link, they give a solution that use v1.4 open mpi instead of  
v1.3 open mpi. but, I am using v1.7a1r22794 open mpi, and met the  
same problem.

here is what I have done:
my cluster composed of two machines:nimbus(master) and  
nimbus1(slave), when I run mpirun -np 40 -am ft-enable-cr -- 
hostfile .mpihostfile myapplication

on the nimbus, and it doesn't work, it shows:

[nimbus1:21387] opal_os_dirpath_create: Error: Unable to create the  
sub-directory (/tmp/openmpi-sessions-mpiu@nimbus1_0/59759) of (/tmp/ 
openmpi-sessions-mpiu@nimbus1_0/59759/0/1), mkdir failed [1]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file util/ 
session_dir.c at line 106
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file util/ 
session_dir.c at line 399
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file base/ 
ess_base_std_orted.c at line 301
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is  
attempting to be sent to a process whose contact information is  
unknown in file rml_oob_send.c at line 104
[nimbus1:21387] [[59759,0],1] could not get route to  
[[INVALID],INVALID]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is  
attempting to be sent to a process whose contact information is  
unknown in file util/show_help.c at line 602
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file  
ess_env_module.c at line 143
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is  
attempting to be sent to a process whose contact information is  
unknown in file rml_oob_send.c at line 104
[nimbus1:21387] [[59759,0],1] could not get route to  
[[INVALID],INVALID]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is  
attempting to be sent to a process whose contact information is  
unknown in file util/show_help.c at line 602
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file runtime/ 
orte_init.c at line 129
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is  
attempting to be sent to a process whose contact information is  
unknown in file rml_oob_send.c at line 104
[nimbus1:21387] [[59759,0],1] could not get route to  
[[INVALID],INVALID]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is  
attempting to be sent to a process whose contact information is  
unknown in file util/show_help.c at line 602
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file orted/ 
orted_main.c at line 355

--
A daemon (pid 10737) died unexpectedly with status 255 while  
attempting

to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed  
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to  
have the

location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--


cheers
fengguang
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI



On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote:

On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian  
 wrote:


I use mpirun -np 50 -am ft-enable-cr --mca  
snapc_base_global_snapshot_dir

--hostfile .mpihostfile 
to store the global checkpoint snapshot into the shared
directory:/mirror,but the problems are still there,
when ompi-checkpoint, the mpirun is still not killed,it is hanging
there.


So the 'ompi-checkpoint' command does not finish? By default 'ompi- 
checkpoint' does not terminate the MPI job. If you pass the '--term'  
option to it, then it will.




when doing ompi-restart, it shows:

mpiu@nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
--
Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid  
because

either you have not provided a filename
   or provided an invalid filename.
   Please see --help for usage.

--




Try removing the trailing '/' in the command. The current ompi-restart  
is not good about differentiating between :

  ompi_global_snapshot_333.ckpt
and
  ompi_global_snapshot_333.ckpt/



Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with
1.4 (but then I didn't try 1.4 with a shared filesystem).


I would also suggest trying v1.4 or 1.5 to see if your problems  
persist with these versions.


-- Josh




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Meaning and the significance of MCA parameter "opal_cr_use_thread"


So the MCA parameter that you mention is explained at the link below:
  http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_use_thread

This enables/disables the C/R thread a runtime if Open MPI was  
configured with C/R thread support:

  http://osl.iu.edu/research/ft/ompi-cr/api.php#conf-enable-ft-thread

The C/R thread enables asynchronous processing of checkpoint requests  
when the application process is not inside the MPI library. The  
purpose of this thread is to improve the responsiveness of the  
checkpoint operation. Without the thread, if the application is in a  
computation loop then the checkpoint will be delayed until the process  
enters the MPI library. With the thread enabled, the checkpoint will  
start in the C/R thread if the application is not in the MPI library.


The primary advantages of the C/R thread are:
 - Response time to the C/R request since the checkpoint is not  
delayed until the process enters the MPI library,
 - Asynchronous processing of the checkpoint while the application is  
executing outside the MPI library (improves the checkpoint overhead  
experienced by the process).


The primary disadvantage of the C/R thread is the additional  
processing task running in parallel with the application. If the C/R  
thread is polling too often it could slow down the main process by  
forcing frequent context switches between the C/R thread and the main  
execution thread. You can adjust the aggressiveness by adjusting the  
parameters at the link below:

  http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_thread_sleep_check

-- Josh

On Mar 24, 2010, at 11:24 AM,   wrote:


The description for MCA parameter “opal_cr_use_thread” is very short  
at URL:  http://osl.iu.edu/research/ft/ompi-cr/api.php


Can someone explain the usefulness of enabling this parameter vs  
disabling it? In other words, what are pros/cons of disabling it?


 I found that this gets enabled automatically when openmpi library  
is configured with –ft-enable-threads option.


Thanks
Ananda
Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any  
attachments to this message are intended for the exclusive use of  
the addressee(s) and may contain proprietary, confidential or  
privileged information. If you are not the intended recipient, you  
should not disseminate, distribute or copy this e-mail. Please  
notify the sender immediately and destroy all copies of this message  
and any attachments.


WARNING: Computer viruses can be transmitted via email. The  
recipient should check this email and any attachments for the  
presence of viruses. The company accepts no liability for any damage  
caused by any virus transmitted by this email.


www.wipro.com

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] mpirun with -am ft-enable-cr option takes longer time on certain configurations



On Mar 20, 2010, at 11:14 PM,   wrote:


I am observing a very strange performance issue with my openmpi  
program.


I have compute intensive openmpi based application that keeps the  
data in memory, process the data and then dumps it to GPFS parallel  
file system. GPFS parallel file system server is connected to a QDR  
infiniband switch from Voltaire.


If my cluster is connected to a DDR infiniband switch which in turn  
connects to file system server on QDR switch, I see that I can run  
my application under checkpoint/restart control (with –am ft-enable- 
cr) and I can checkpoint (ompi-checkpoint) successfully and the  
application gets completed after few additional seconds.


If my cluster is connected to the same QDR switch which connects to  
file system server, I see that my application takes close to 10x  
time to complete if I run it under checkpoint/restart control (with – 
am ft-enable-cr). If I run the same application using a plain mpirun  
command (ie; without -am ft_enable_cr), it finishes within a minute.


The 10x slowdown is without taking a checkpoint, correct? If the  
checkpoint is taking up part of the bandwidth through the same switch  
you are communicating with, then you will see diminished performance  
until the checkpoint is fully established on the storage device(s).  
Many installations separate the communication and storage networks (or  
limiting the bandwidth of one of them) to prevent one from  
unexpectedly demising the performance of the other, even outside of  
the C/R context.


However for a non-checkpointing run to be 10x slower is certainly not  
normal. Try playing with the C/R thread parameters (mentioned in a  
previous email) and see if that helps. If not, we might be able to try  
other things.


-- Josh 





I am using open mpi 1.3.4 and BLCR 0.8.2 for checkpointing

Are there any specific MCA parameters that I should tune to address  
this problem? Any other pointers will be really helpful.


Thanks
Anand
Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any  
attachments to this message are intended for the exclusive use of  
the addressee(s) and may contain proprietary, confidential or  
privileged information. If you are not the intended recipient, you  
should not disseminate, distribute or copy this e-mail. Please  
notify the sender immediately and destroy all copies of this message  
and any attachments.


WARNING: Computer viruses can be transmitted via email. The  
recipient should check this email and any attachments for the  
presence of viruses. The company accepts no liability for any damage  
caused by any virus transmitted by this email.


www.wipro.com

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] mpirun with -am ft-enable-cr option runs slow if hyperthreading is disabled

On Mar 22, 2010, at 4:41 PM, wrote:

If the run my compute intensive openmpi based program using regular
invocation of mpirun (ie; mpirun –host -np cores>), it gets completed in few seconds but if I run the same
program with “-am ft-enable-cr” option, the program takes 10x time
to complete.

If I enable hyperthreading on my cluster nodes and then call mpirun
with “-am ft-enable-cr” option, the program gets completed with few
additional seconds than the normal mpirun!!

How can I improve the performance of mpirun with “-am ft-enable-cr”
option when I disable hyperthreading on my cluster nodes? Any
pointers will be really useful.

FYI, I am using openmpi 1.3.4 library and BLCR 0.8.2. Cluster nodes
are Nehalem based nodes with 8 cores.

I have not done any performance studies focused on hyperthreading, so
I can not say specifically what is happening.

The 10x slowdown is certainly unexpected (I don't see this in my
testing). There usually is a small slowdown (few microseconds) because
of the message tracking technique used to support the checkpoint
coordination protocol. I suspect that the cause of your problem is the
C/R thread which is probably too aggressive for your system. The
improvement with hyperthreading may be that this thread is able to sit
on one of the hardware threads and not completely steal the CPU from
the main application.

You can change how aggressive the thread is by adjusting the two
parameters below:

http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_thread_sleep_check
http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_thread_sleep_wait

I usually set the latter to:
opal_cr_thread_sleep_wait=1000

Give that a try and let me know is that helps. You might also try to
upgrade to the 1.4 series, or even the upcoming v1.5.0 release and see
if the problem persists there.

-- Josh

Thanks
Anand
Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of
the addressee(s) and may contain proprietary, confidential or
privileged information. If you are not the intended recipient, you
should not disseminate, distribute or copy this e-mail. Please
notify the sender immediately and destroy all copies of this message
and any attachments.

WARNING: Computer viruses can be transmitted via email. The
recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus transmitted by this email.

www.wipro.com

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] top command output shows huge CPU utilization when openmpi processes resume after the checkpoint



On Mar 21, 2010, at 12:58 PM, Addepalli, Srirangam V wrote:


Yes We have seen this behavior too.

Another behavior I have seen is that one MPI process starts to  
show different elapsed time than its peers. Is it because  
checkpoint happened on behalf of this process?


R

From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On  
Behalf Of ananda.mu...@wipro.com [ananda.mu...@wipro.com]

Sent: Saturday, March 20, 2010 10:18 PM
To: us...@open-mpi.org
Subject: [OMPI users] top command output shows huge CPU utilization  
whenopenmpi processes resume after the checkpoint


When I checkpoint my openmpi application using ompi_checkpoint, I  
see that top command suddenly shows some really huge numbers in "CPU  
%" field such as 150% 200% etc. After sometime, these numbers do  
come back to the normal numbers under 100%. This happens exactly  
around the time checkpoint is completed and when the processes are  
resuming the execution.


One cause for this type of CPU utilization is due to the C/R thread.  
During non-checkpoint/normal processing the thread is polling for a  
checkpoint fairly aggressively. You can change how aggressive the  
thread is by adjusting the two parameters below:

 http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_thread_sleep_check
 http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_thread_sleep_wait

I usually set the latter to:
 opal_cr_thread_sleep_wait=1000

You can also turn off the C/R thread, either by configure'ing without  
it, or disabling it at runtime by setting the 'opal_cr_use_thread'  
parameter to '0':

 http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_use_thread


The CPU increase during the checkpoint may be due to both the Open MPI  
C/R thread, and the BLCR thread becoming active on the machine. You  
might try to determine whether this is BLCR's CPU utilization or Open  
MPI's by creating a single process application and watching the CPU  
utilization when checkpointing with BLCR. You may also want to look at  
the memory consumption of the process to make sure that there is  
enough for BLCR to run efficiently.


This may also be due to processes finished with the checkpoint waiting  
on other peer processes to finish. I don't think we have a good way to  
control how aggressively these waiting processes poll for completion  
of peers. If this becomes a problem we can look into adding a  
parameter similar to opal_cr_thread_sleep_wait to throttle the polling  
on the machine.


The disadvantage of making the various polling for completion loops  
less aggressive, is that the checkpoint may stall the checkpoint and/ 
or application for a little longer than necessary. But if this is  
acceptable to the user, then they can adjust the MCA parameters as  
necessary.




Another behavior I have seen is that one MPI process starts to show  
different elapsed time than its peers. Is it because checkpoint  
happened on behalf of this process?


Can you explain a bit more about what you mean by this? Neither Open  
MPI nor BLCR messes with the timer on the machine, so we are not  
changing it in any way. The process is 'stopped' briefly while BLCR  
takes the checkpoint, so this will extend the running time of the  
process. How much the running time is extended (a.k.a. checkpoint  
overhead) is determined by a bunch of things, but primarily by the  
storage device(s) that the checkpoint is being written to.




For your reference, I am using open mpi 1.3.4 and BLCR 0.8.2 for  
checkpointing.


It would be interesting to know if you see the same behavior with the  
trunk or v1.5 series of Open MPI.


Hope that helps,
Josh



Thanks
Anand

Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any  
attachments to this message are intended for the exclusive use of  
the addressee(s) and may contain proprietary, confidential or  
privileged information. If you are not the intended recipient, you  
should not disseminate, distribute or copy this e-mail. Please  
notify the sender immediately and destroy all copies of this message  
and any attachments.


WARNING: Computer viruses can be transmitted via email. The  
recipient should check this email and any attachments for the  
presence of viruses. The company accepts no liability for any damage  
caused by any virus transmitted by this email.


www.wipro.com


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] configuration and compilation outputs

I don't see -static listed in the config.log at all, but I see it listed in the 
make output that you sent in the first mail.

Additionally, the make output that you sent in your mail doesn't seem to match 
the make.output that you attached in your last email.  Are you mixing and 
matching multiple builds by accident, perchance?

FWIW, it's typically best to set flags in configure via the configure command 
line, like this:

  ./configure CFLAGS=-static etc...

Rather than setenv'ing them before running configure.  The (minor) advantage of 
this is that all the flags are then recorded in the config.log file.  If you 
setenv them, then config.log doesn't show everything.

On Mar 29, 2010, at 9:02 AM, Philippe GOURET wrote:

> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

[OMPI users] configuration and compilation outputs

2010-03-29 Thread Philippe GOURET



outputs.tar.gz
Description: File Attachment: outputs.tar.gz

Re: [OMPI users] LAM: static