Re: [OMPI users] MPI_Init never returns on IA64
Could you try one of the 1.4.2 nightly tarballs and see if that makes the issue better? http://www.open-mpi.org/nightly/v1.4/ On Mar 29, 2010, at 7:47 PM, Shaun Jackman wrote: > Hi, > > On an IA64 platform, MPI_Init never returns. I fired up GDB and it seems > that ompi_free_list_grow never returns. My test program does nothing but > call MPI_Init. Here's the backtrace: > > (gdb) bt > #0 0x20075620 in ompi_free_list_grow () from > /home/aubjtl/openmpi/lib/libmpi.so.0 > #1 0x20078e50 in ompi_rb_tree_init () from > /home/aubjtl/openmpi/lib/libmpi.so.0 > #2 0x20160840 in mca_mpool_base_tree_init () from > /home/aubjtl/openmpi/lib/libmpi.so.0 > #3 0x2015dac0 in mca_mpool_base_open () from > /home/aubjtl/openmpi/lib/libmpi.so.0 > #4 0x200bfd30 in ompi_mpi_init () from > /home/aubjtl/openmpi/lib/libmpi.so.0 > #5 0x2010efb0 in PMPI_Init () from > /home/aubjtl/openmpi/lib/libmpi.so.0 > #6 0x4b70 in main () > > Any suggestion how I can trouble shoot? > > $ mpirun --version > mpirun (Open MPI) 1.4.1 > $ ./config.guess > ia64-unknown-linux-gnu > > Thanks, > Shaun > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] MPI_Init never returns on IA64
Hi, On an IA64 platform, MPI_Init never returns. I fired up GDB and it seems that ompi_free_list_grow never returns. My test program does nothing but call MPI_Init. Here's the backtrace: (gdb) bt #0 0x20075620 in ompi_free_list_grow () from /home/aubjtl/openmpi/lib/libmpi.so.0 #1 0x20078e50 in ompi_rb_tree_init () from /home/aubjtl/openmpi/lib/libmpi.so.0 #2 0x20160840 in mca_mpool_base_tree_init () from /home/aubjtl/openmpi/lib/libmpi.so.0 #3 0x2015dac0 in mca_mpool_base_open () from /home/aubjtl/openmpi/lib/libmpi.so.0 #4 0x200bfd30 in ompi_mpi_init () from /home/aubjtl/openmpi/lib/libmpi.so.0 #5 0x2010efb0 in PMPI_Init () from /home/aubjtl/openmpi/lib/libmpi.so.0 #6 0x4b70 in main () Any suggestion how I can trouble shoot? $ mpirun --version mpirun (Open MPI) 1.4.1 $ ./config.guess ia64-unknown-linux-gnu Thanks, Shaun
Re: [OMPI users] openMPI on Xgrid
I have an environment a few trusted users could use to test. However, I have neither the expertise or time to do the debugging myself. Cheers, Jody On 2010-03-29, at 1:27 PM, Jeff Squyres wrote: On Mar 29, 2010, at 4:11 PM, Cristobal Navarro wrote: i realized that xcode dev tools include openMPI 1.2.x should i keep trying?? or do you recommend to completly abandon xgrid and go for another tool like Torque with openMPI? FWIW, Open MPI v1.2.x is fairly ancient -- the v1.4 series includes a few years worth of improvements and bug fixes since the 1.2 series. It would be great (hint hint) if someone could fix the xgrid support for us... We simply no longer have anyone in the active development group who has the expertise or test environment to make our xgrid work. :-( -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openMPI on Xgrid
On Mar 29, 2010, at 4:11 PM, Cristobal Navarro wrote: > i realized that xcode dev tools include openMPI 1.2.x > should i keep trying?? > or do you recommend to completly abandon xgrid and go for another tool like > Torque with openMPI? FWIW, Open MPI v1.2.x is fairly ancient -- the v1.4 series includes a few years worth of improvements and bug fixes since the 1.2 series. It would be great (hint hint) if someone could fix the xgrid support for us... We simply no longer have anyone in the active development group who has the expertise or test environment to make our xgrid work. :-( -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] openMPI on Xgrid
at least it would be a good exercise to complete the process with xgrid + openMPI for the knowledge Cristobal On Mon, Mar 29, 2010 at 4:11 PM, Cristobal Navarrowrote: > i realized that xcode dev tools include openMPI 1.2.x > should i keep trying?? > or do you recommend to completly abandon xgrid and go for another tool like > Torque with openMPI? > > > > > On Mon, Mar 29, 2010 at 3:48 PM, Jody Klymak wrote: > >> >> On Mar 29, 2010, at 12:39 PM, Ralph Castain wrote: >> >> >> On Mar 29, 2010, at 1:34 PM, Cristobal Navarro wrote: >> >> thanks for the information, >> >> but is it possible to make it work with xgrid or the 1.4.1 version just >> dont support it? >> >> >> FWIW, I've had excellent success with Torque and openmpi on OS-X 10.5 >> Server. >> >> http://www.clusterresources.com/products/torque-resource-manager.php >> >> It doesn't have a nice dashboard, but the queue tools are more than >> adequate for my needs. >> >> Open MPI had a funny port issue on my setup that folks helped with >> >> From my notes: >> >> Edited /Network/Xgrid/openmpi/etc/openmpi-mca-params.conf to make sure >> that the right ports are used: >> >> >> # set ports so that they are more valid than the default ones (see email >> from Ralph Castain) >> btl_tcp_port_min_v4 = 36900 >> btl_tcp_port_range = 32 >> >> >> Cheers, Jody >> >> >> -- >> Jody Klymak >> http://web.uvic.ca/~jklymak/ >> >> >> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >
Re: [OMPI users] openMPI on Xgrid
i realized that xcode dev tools include openMPI 1.2.x should i keep trying?? or do you recommend to completly abandon xgrid and go for another tool like Torque with openMPI? On Mon, Mar 29, 2010 at 3:48 PM, Jody Klymakwrote: > > On Mar 29, 2010, at 12:39 PM, Ralph Castain wrote: > > > On Mar 29, 2010, at 1:34 PM, Cristobal Navarro wrote: > > thanks for the information, > > but is it possible to make it work with xgrid or the 1.4.1 version just > dont support it? > > > FWIW, I've had excellent success with Torque and openmpi on OS-X 10.5 > Server. > > http://www.clusterresources.com/products/torque-resource-manager.php > > It doesn't have a nice dashboard, but the queue tools are more than > adequate for my needs. > > Open MPI had a funny port issue on my setup that folks helped with > > From my notes: > > Edited /Network/Xgrid/openmpi/etc/openmpi-mca-params.conf to make sure > that the right ports are used: > > > # set ports so that they are more valid than the default ones (see email > from Ralph Castain) > btl_tcp_port_min_v4 = 36900 > btl_tcp_port_range = 32 > > > Cheers, Jody > > > -- > Jody Klymak > http://web.uvic.ca/~jklymak/ > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] OPEN_MPI macro for mpif.h?
Hello, looking at the Open MPI mpi.h include file there's a preprocessor macro OPEN_MPI defined, as well as e.g. OMPI_MAJOR_VERSION, OMPI_MINOR_VERSION and OMPI_RELEASE_VERSION. version.h e.g. also defines OMPI_VERSION This seems to be missing in mpif.h and therefore something like include 'mpif.h' [...] #ifdef OPEN_MPI write( *, '("MPI library: OpenMPI",I2,".",I2,".",I2)' ) & &OMPI_MAJOR_VERSION, OMPI_MINOR_VERSION, OMPI_RELEASE_VERSION #endif doesn't work for a FORTRAN openmpi program. Which are the Open MPI specific preprocessor macros to be used for the Fortran binding? Thanks, Martin -- Dr.-Ing. Martin Bernreuther University of Stuttgart High Performance Computing Center (HLRS) Nobelstrasse 19 (Office: Allmandring 30, 0.032) 70569 Stuttgart, Germany Phone: (++49-(0)711) 685-64542, Fax: (++49-(0)711) 685-65832 E-Mail: bernreut...@hlrs.de URL: http://www.hlrs.de/people/bernreuther/
Re: [OMPI users] openMPI on Xgrid
On Mar 29, 2010, at 12:39 PM, Ralph Castain wrote: On Mar 29, 2010, at 1:34 PM, Cristobal Navarro wrote: thanks for the information, but is it possible to make it work with xgrid or the 1.4.1 version just dont support it? FWIW, I've had excellent success with Torque and openmpi on OS-X 10.5 Server. http://www.clusterresources.com/products/torque-resource-manager.php It doesn't have a nice dashboard, but the queue tools are more than adequate for my needs. Open MPI had a funny port issue on my setup that folks helped with From my notes: Edited /Network/Xgrid/openmpi/etc/openmpi-mca-params.conf to make sure that the right ports are used: # set ports so that they are more valid than the default ones (see email from Ralph Castain) btl_tcp_port_min_v4 = 36900 btl_tcp_port_range = 32 Cheers, Jody -- Jody Klymak http://web.uvic.ca/~jklymak/
Re: [OMPI users] openMPI on Xgrid
On Mar 29, 2010, at 1:34 PM, Cristobal Navarro wrote: > thanks for the information, > > but is it possible to make it work with xgrid or the 1.4.1 version just dont > support it? > > I'm afraid it just doesn't support it - we made the support compile, but we have no way to test/debug the operation, so it is turned "off". > > > On Mon, Mar 29, 2010 at 3:07 PM, Ralph Castainwrote: > Our xgrid support has been broken for some time now due to lack of access to > a test environment. So your system is using rsh/ssh instead. > > Until we get someone interested in xgrid, or at least willing to debug it and > tell us what needs to be done, I'm afraid our xgrid support will be lacking. > > > On Mar 29, 2010, at 12:56 PM, Cristobal Navarro wrote: > >> Hello, >> i am new on this mailing list! >> i've read the other messages about configuring openMPI on Xgrid, but i >> havent solved my problem yet and openMPI keeps running as if Xgrid didnt >> exist. >> >> i configured xgrid properly, and can send simple C program jobs trough the >> command line from my client, which is the same as the controller and the >> same as the agent for the moment. >> >> xgrid -h localhost -p pass -job run ./helloWorld >> i also installed xgrid Admin for monitoring. >> >> then, >> i compiled openMPI 1.4.1 with these options >> >> /configure --prefix=/usr/local/openmpi/ --enable-shared --disable-static >> --with-xgrid >> sudo make >> sudo make install >> >> and i made a simple helloMPI example. >> >> >> /* MPI C Example */ >> #include >> #include >> >> int main (argc, argv) >> int argc; >> char *argv[]; >> { >> int rank, size; >> >> MPI_Init (, ); /* starts MPI */ >> MPI_Comm_rank (MPI_COMM_WORLD, ); /* get current process id */ >> MPI_Comm_size (MPI_COMM_WORLD, ); /* get number of processes */ >> printf( "Hello world from process %d of %d\n", rank, size ); >> MPI_Finalize(); >> return 0; >> } >> >> and compiled succesfully >> >> >> mpicc hellompi.c -o hellompi >> >> the i run it >> >> >> mpirun -np 2 hellompi >> I am running on ijorge.local >> Hello World from process 0 of 2 >> I am running on ijorge.local >> Hello World from process 1 of 2 >> >> the results are correct, but when i check Xgrid Admin, i see that the >> execution didnt go trought Xgrid since there arent any new jobs on the list. >> in the end, openMPI and Xgrid are not comunicating to each other. >> >> what am i missing?? >> >> my enviroment variables are these: >> >> >>echo $XGRID_CONTROLLER_HOSTNAME >> ijorge.local >> >>echo $XGRID_CONTROLLER_PASSWORD >> myPassword >> >> >> any help is welcome!! >> thanks in advance >> >> Cristobal >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openMPI on Xgrid
thanks for the information, but is it possible to make it work with xgrid or the 1.4.1 version just dont support it? On Mon, Mar 29, 2010 at 3:07 PM, Ralph Castainwrote: > Our xgrid support has been broken for some time now due to lack of access > to a test environment. So your system is using rsh/ssh instead. > > Until we get someone interested in xgrid, or at least willing to debug it > and tell us what needs to be done, I'm afraid our xgrid support will be > lacking. > > > On Mar 29, 2010, at 12:56 PM, Cristobal Navarro wrote: > > Hello, > i am new on this mailing list! > i've read the other messages about configuring openMPI on Xgrid, but i > havent solved my problem yet and openMPI keeps running as if Xgrid didnt > exist. > > i configured xgrid properly, and can send simple C program jobs trough the > command line from my client, which is the same as the controller and the > same as the agent for the moment. > >> xgrid -h localhost -p pass -job run ./helloWorld > i also installed xgrid Admin for monitoring. > > then, > i compiled openMPI 1.4.1 with these options > > /configure --prefix=/usr/local/openmpi/ --enable-shared --disable-static > --with-xgrid > sudo make > sudo make install > > and i made a simple helloMPI example. > > > /* MPI C Example */ > #include > #include > > int main (argc, argv) > int argc; > char *argv[]; > { > int rank, size; > > MPI_Init (, ); /* starts MPI */ > MPI_Comm_rank (MPI_COMM_WORLD, ); /* get current process id */ > MPI_Comm_size (MPI_COMM_WORLD, ); /* get number of processes */ > printf( "Hello world from process %d of %d\n", rank, size ); > MPI_Finalize(); > return 0; > } > > > and compiled succesfully > > >> mpicc hellompi.c -o hellompi > > the i run it > > >> mpirun -np 2 hellompi > I am running on ijorge.local > Hello World from process 0 of 2 > I am running on ijorge.local > Hello World from process 1 of 2 > > the results are correct, but when i check Xgrid Admin, i see that the > execution didnt go trought Xgrid since there arent any new jobs on the list. > in the end, openMPI and Xgrid are not comunicating to each other. > > what am i missing?? > > my enviroment variables are these: > > >>echo $XGRID_CONTROLLER_HOSTNAME > ijorge.local > >>echo $XGRID_CONTROLLER_PASSWORD > myPassword > > > any help is welcome!! > thanks in advance > > Cristobal > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] openMPI on Xgrid
Our xgrid support has been broken for some time now due to lack of access to a test environment. So your system is using rsh/ssh instead. Until we get someone interested in xgrid, or at least willing to debug it and tell us what needs to be done, I'm afraid our xgrid support will be lacking. On Mar 29, 2010, at 12:56 PM, Cristobal Navarro wrote: > Hello, > i am new on this mailing list! > i've read the other messages about configuring openMPI on Xgrid, but i havent > solved my problem yet and openMPI keeps running as if Xgrid didnt exist. > > i configured xgrid properly, and can send simple C program jobs trough the > command line from my client, which is the same as the controller and the same > as the agent for the moment. > >> xgrid -h localhost -p pass -job run ./helloWorld > i also installed xgrid Admin for monitoring. > > then, > i compiled openMPI 1.4.1 with these options > > /configure --prefix=/usr/local/openmpi/ --enable-shared --disable-static > --with-xgrid > sudo make > sudo make install > > and i made a simple helloMPI example. > > > /* MPI C Example */ > #include > #include > > int main (argc, argv) > int argc; > char *argv[]; > { > int rank, size; > > MPI_Init (, ); /* starts MPI */ > MPI_Comm_rank (MPI_COMM_WORLD, ); /* get current process id */ > MPI_Comm_size (MPI_COMM_WORLD, ); /* get number of processes */ > printf( "Hello world from process %d of %d\n", rank, size ); > MPI_Finalize(); > return 0; > } > > and compiled succesfully > > >> mpicc hellompi.c -o hellompi > > the i run it > > >> mpirun -np 2 hellompi > I am running on ijorge.local > Hello World from process 0 of 2 > I am running on ijorge.local > Hello World from process 1 of 2 > > the results are correct, but when i check Xgrid Admin, i see that the > execution didnt go trought Xgrid since there arent any new jobs on the list. > in the end, openMPI and Xgrid are not comunicating to each other. > > what am i missing?? > > my enviroment variables are these: > > >>echo $XGRID_CONTROLLER_HOSTNAME > ijorge.local > >>echo $XGRID_CONTROLLER_PASSWORD > myPassword > > > any help is welcome!! > thanks in advance > > Cristobal > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] openMPI on Xgrid
Hello, i am new on this mailing list! i've read the other messages about configuring openMPI on Xgrid, but i havent solved my problem yet and openMPI keeps running as if Xgrid didnt exist. i configured xgrid properly, and can send simple C program jobs trough the command line from my client, which is the same as the controller and the same as the agent for the moment. >> xgrid -h localhost -p pass -job run ./helloWorld i also installed xgrid Admin for monitoring. then, i compiled openMPI 1.4.1 with these options /configure --prefix=/usr/local/openmpi/ --enable-shared --disable-static --with-xgrid sudo make sudo make install and i made a simple helloMPI example. /* MPI C Example */ #include #include int main (argc, argv) int argc; char *argv[]; { int rank, size; MPI_Init (, ); /* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, ); /* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, ); /* get number of processes */ printf( "Hello world from process %d of %d\n", rank, size ); MPI_Finalize(); return 0; } and compiled succesfully >> mpicc hellompi.c -o hellompi the i run it >> mpirun -np 2 hellompi I am running on ijorge.local Hello World from process 0 of 2 I am running on ijorge.local Hello World from process 1 of 2 the results are correct, but when i check Xgrid Admin, i see that the execution didnt go trought Xgrid since there arent any new jobs on the list. in the end, openMPI and Xgrid are not comunicating to each other. what am i missing?? my enviroment variables are these: >>echo $XGRID_CONTROLLER_HOSTNAME ijorge.local >>echo $XGRID_CONTROLLER_PASSWORD myPassword any help is welcome!! thanks in advance Cristobal
Re: [OMPI users] Segmentation fault (11)
Hi Josh/All, I just tested a simple c application with blcr and it worked fine. ## #include #include #include #include #include #include #include #include #include #include #include char * getprocessid() { FILE * read_fp; char buffer[BUFSIZ + 1]; int chars_read; char * buffer_data="12345"; memset(buffer, '\0', sizeof(buffer)); read_fp = popen("uname -a", "r"); /* ... */ return buffer_data; } int main(int argc, char ** argv) { int rank; int size; char * thedata; int n=0; thedata=getprocessid(); printf(" the data is %s", thedata); while( n <10) { printf("value is %d\n", n); n++; sleep(1); } printf("bye\n"); } jean@sun32:/tmp$ cr_run ./pipetest3 & [1] 31807 jean@sun32:~$ the data is 12345value is 0 value is 1 value is 2 ... value is 9 bye jean@sun32:/tmp$ cr_checkpoint 31807 jean@sun32:/tmp$ cr_restart context.31807 value is 7 value is 8 value is 9 bye ## It looks like its more to do with Openmpi. Any ideas from you side? Thank you. Kind regards, Jean. --- On Mon, 29/3/10, Josh Hurseywrote: From: Josh Hursey Subject: Re: [OMPI users] Segmentation fault (11) To: "Open MPI Users" List-Post: users@lists.open-mpi.org Date: Monday, 29 March, 2010, 16:08 I wonder if this is a bug with BLCR (since the segv stack is in the BLCR thread). Can you try an non-MPI version of this application that uses popen(), and see if BLCR properly checkpoints/restarts it? If so, we can start to see what Open MPI might be doing to confuse things, but I suspect that this might be a bug with BLCR. Either way let us know what you find out. Cheers, Josh On Mar 27, 2010, at 6:17 AM, jody wrote: > I'm not sure if this is the cause of your problems: > You define the constant BUFFER_SIZE, but in the code you use a constant > called BUFSIZ... > Jody > > > On Fri, Mar 26, 2010 at 10:29 PM, Jean Potsam wrote: > Dear All, > I am having a problem with openmpi . I have installed openmpi >1.4 and blcr 0.8.1 > > I have written a small mpi application as follows below: > > ### > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > > #define BUFFER_SIZE PIPE_BUF > > char * getprocessid() > { > FILE * read_fp; > char buffer[BUFSIZ + 1]; > int chars_read; > char * buffer_data="12345"; > memset(buffer, '\0', sizeof(buffer)); > read_fp = popen("uname -a", "r"); > /* > ... > */ > return buffer_data; > } > > int main(int argc, char ** argv) > { > MPI_Status status; > int rank; > int size; > char * thedata; > MPI_Init(, ); > MPI_Comm_size(MPI_COMM_WORLD,); > MPI_Comm_rank(MPI_COMM_WORLD,); > thedata=getprocessid(); > printf(" the data is %s", thedata); > MPI_Finalize(); > } > > > I get the following result: > > ### > jean@sunn32:~$ mpicc pipetest2.c -o pipetest2 > jean@sunn32:~$ mpirun -np 1 -am ft-enable-cr -mca btl ^openib pipetest2 > [sun32:19211] *** Process received signal *** > [sun32:19211] Signal: Segmentation fault (11) > [sun32:19211] Signal code: Address not mapped (1) > [sun32:19211] Failing at address: 0x4 > [sun32:19211] [ 0] [0xb7f3c40c] > [sun32:19211] [ 1] /lib/libc.so.6(cfree+0x3b) [0xb796868b] > [sun32:19211] [ 2] /usr/local/blcr/lib/libcr.so.0(cri_info_free+0x2a) > [0xb7a5925a] > [sun32:19211] [ 3] /usr/local/blcr/lib/libcr.so.0 [0xb7a5ac72] > [sun32:19211] [ 4] /lib/libc.so.6(__libc_fork+0x186) [0xb7991266] > [sun32:19211] [ 5] /lib/libc.so.6(_IO_proc_open+0x7e) [0xb7958b6e] > [sun32:19211] [ 6] /lib/libc.so.6(popen+0x6c) [0xb7958dfc] > [sun32:19211] [ 7] pipetest2(getprocessid+0x42) [0x8048836] > [sun32:19211] [ 8] pipetest2(main+0x4d) [0x8048897] > [sun32:19211] [ 9] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7912455] > [sun32:19211] [10] pipetest2 [0x8048761] > [sun32:19211] *** End of error message *** > # > > > However, If I compile the application using gcc, it works fine. The problem > arises with: > read_fp = popen("uname -a", "r"); > > Does anyone has an idea how to resolve this problem? > > Many thanks > > Jean > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segmentation fault (11)
I wonder if this is a bug with BLCR (since the segv stack is in the BLCR thread). Can you try an non-MPI version of this application that uses popen(), and see if BLCR properly checkpoints/restarts it? If so, we can start to see what Open MPI might be doing to confuse things, but I suspect that this might be a bug with BLCR. Either way let us know what you find out. Cheers, Josh On Mar 27, 2010, at 6:17 AM, jody wrote: I'm not sure if this is the cause of your problems: You define the constant BUFFER_SIZE, but in the code you use a constant called BUFSIZ... Jody On Fri, Mar 26, 2010 at 10:29 PM, Jean Potsamwrote: Dear All, I am having a problem with openmpi . I have installed openmpi 1.4 and blcr 0.8.1 I have written a small mpi application as follows below: ### #include #include #include #include #include #include #include #include #include #include #include #include #define BUFFER_SIZE PIPE_BUF char * getprocessid() { FILE * read_fp; char buffer[BUFSIZ + 1]; int chars_read; char * buffer_data="12345"; memset(buffer, '\0', sizeof(buffer)); read_fp = popen("uname -a", "r"); /* ... */ return buffer_data; } int main(int argc, char ** argv) { MPI_Status status; int rank; int size; char * thedata; MPI_Init(, ); MPI_Comm_size(MPI_COMM_WORLD,); MPI_Comm_rank(MPI_COMM_WORLD,); thedata=getprocessid(); printf(" the data is %s", thedata); MPI_Finalize(); } I get the following result: ### jean@sunn32:~$ mpicc pipetest2.c -o pipetest2 jean@sunn32:~$ mpirun -np 1 -am ft-enable-cr -mca btl ^openib pipetest2 [sun32:19211] *** Process received signal *** [sun32:19211] Signal: Segmentation fault (11) [sun32:19211] Signal code: Address not mapped (1) [sun32:19211] Failing at address: 0x4 [sun32:19211] [ 0] [0xb7f3c40c] [sun32:19211] [ 1] /lib/libc.so.6(cfree+0x3b) [0xb796868b] [sun32:19211] [ 2] /usr/local/blcr/lib/libcr.so.0(cri_info_free +0x2a) [0xb7a5925a] [sun32:19211] [ 3] /usr/local/blcr/lib/libcr.so.0 [0xb7a5ac72] [sun32:19211] [ 4] /lib/libc.so.6(__libc_fork+0x186) [0xb7991266] [sun32:19211] [ 5] /lib/libc.so.6(_IO_proc_open+0x7e) [0xb7958b6e] [sun32:19211] [ 6] /lib/libc.so.6(popen+0x6c) [0xb7958dfc] [sun32:19211] [ 7] pipetest2(getprocessid+0x42) [0x8048836] [sun32:19211] [ 8] pipetest2(main+0x4d) [0x8048897] [sun32:19211] [ 9] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7912455] [sun32:19211] [10] pipetest2 [0x8048761] [sun32:19211] *** End of error message *** # However, If I compile the application using gcc, it works fine. The problem arises with: read_fp = popen("uname -a", "r"); Does anyone has an idea how to resolve this problem? Many thanks Jean ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
On Mar 29, 2010, at 11:53 AM, fengguang tian wrote: hi i have used the --term option,but the mpirun is still hanging,it is the same whether I include the ' / ' or not.I am installing the v1.4 to see whether the problems are still there. I tried, but some problems are still there. What configure options did you use when building Open MPI? BTW, my MPI program will have some input file, and will generate some output file after some computation, it can be checkpointed,but when restart it, some error happened,have you met this kind of problem? Try putting the 'snapc_base_global_snapshot_dir' in the $HOME/.openmpi/ mca-params.conf file instead of just on the command line. Like: snapc_base_global_snapshot_dir=/shared-dir/ I suspect that ompi-restart is looking in the wrong place for your checkpoint. By default it will search $HOME (since that is the default for snapc_base_global_snapshot_dir). If you put this parameter in the mca-params.conf file, then it is always set in any tool (mpirun/ompi- checkpoint/ompi-restart) to the specified value. So ompi-restart will search the correct location for the checkpoint files. -- Josh cheers fengguang On Mon, Mar 29, 2010 at 11:42 AM, Josh Hurseywrote: On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote: On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian wrote: I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir --hostfile .mpihostfile to store the global checkpoint snapshot into the shared directory:/mirror,but the problems are still there, when ompi-checkpoint, the mpirun is still not killed,it is hanging there. So the 'ompi-checkpoint' command does not finish? By default 'ompi- checkpoint' does not terminate the MPI job. If you pass the '--term' option to it, then it will. when doing ompi-restart, it shows: mpiu@nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/ -- Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because either you have not provided a filename or provided an invalid filename. Please see --help for usage. -- Try removing the trailing '/' in the command. The current ompi- restart is not good about differentiating between : ompi_global_snapshot_333.ckpt and ompi_global_snapshot_333.ckpt/ Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with 1.4 (but then I didn't try 1.4 with a shared filesystem). I would also suggest trying v1.4 or 1.5 to see if your problems persist with these versions. -- Josh ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster
hi I solve this problem, some previous versions of directories in the cluster are not removed, after I remove them, it works fine. thank you cheers fengguang On Mon, Mar 29, 2010 at 11:47 AM, Josh Hurseywrote: > Does this happen when you run without '-am ft-enable-cr' (so a no-C/R run)? > > This will help us determine if your problem is with the C/R work or with > the ORTE runtime. I suspect that there is something odd with your system > that is confusing the runtime (so not a C/R problem). > > Have you made sure to remove the previous versions of Open MPI from all > machines on your cluster, before installing the new version? Sometimes > problems like this come up because of mismatches in Open MPI versions on a > machine. > > -- Josh > > > On Mar 23, 2010, at 5:42 PM, fengguang tian wrote: > > I met the same problem with this link: >> http://www.open-mpi.org/community/lists/users/2009/12/11374.php >> >> in the link, they give a solution that use v1.4 open mpi instead of v1.3 >> open mpi. but, I am using v1.7a1r22794 open mpi, and met the same problem. >> here is what I have done: >> my cluster composed of two machines:nimbus(master) and nimbus1(slave), >> when I run mpirun -np 40 -am ft-enable-cr --hostfile .mpihostfile >> myapplication >> on the nimbus, and it doesn't work, it shows: >> >> [nimbus1:21387] opal_os_dirpath_create: Error: Unable to create the >> sub-directory (/tmp/openmpi-sessions-mpiu@nimbus1_0/59759) of >> (/tmp/openmpi-sessions-mpiu@nimbus1_0/59759/0/1), mkdir failed [1] >> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file >> util/session_dir.c at line 106 >> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file >> util/session_dir.c at line 399 >> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file >> base/ess_base_std_orted.c at line 301 >> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to >> be sent to a process whose contact information is unknown in file >> rml_oob_send.c at line 104 >> [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID] >> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to >> be sent to a process whose contact information is unknown in file >> util/show_help.c at line 602 >> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file >> ess_env_module.c at line 143 >> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to >> be sent to a process whose contact information is unknown in file >> rml_oob_send.c at line 104 >> [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID] >> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to >> be sent to a process whose contact information is unknown in file >> util/show_help.c at line 602 >> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file >> runtime/orte_init.c at line 129 >> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to >> be sent to a process whose contact information is unknown in file >> rml_oob_send.c at line 104 >> [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID] >> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to >> be sent to a process whose contact information is unknown in file >> util/show_help.c at line 602 >> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file >> orted/orted_main.c at line 355 >> -- >> A daemon (pid 10737) died unexpectedly with status 255 while attempting >> to launch so we are aborting. >> >> There may be more information reported by the environment (see above). >> >> This may be because the daemon was unable to find all the needed shared >> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the >> location of the shared libraries on the remote nodes and this will >> automatically be forwarded to the remote nodes. >> -- >> -- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -- >> >> >> cheers >> fengguang >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
hi i have used the --term option,but the mpirun is still hanging,it is the same whether I include the ' / ' or not.I am installing the v1.4 to see whether the problems are still there. I tried, but some problems are still there. BTW, my MPI program will have some input file, and will generate some output file after some computation, it can be checkpointed,but when restart it, some error happened,have you met this kind of problem? cheers fengguang On Mon, Mar 29, 2010 at 11:42 AM, Josh Hurseywrote: > > On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote: > > On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian >> wrote: >> >>> >>> I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir >>> --hostfile .mpihostfile >>> to store the global checkpoint snapshot into the shared >>> directory:/mirror,but the problems are still there, >>> when ompi-checkpoint, the mpirun is still not killed,it is hanging >>> there. >>> >> > So the 'ompi-checkpoint' command does not finish? By default > 'ompi-checkpoint' does not terminate the MPI job. If you pass the '--term' > option to it, then it will. > > > > when doing ompi-restart, it shows: >>> >>> mpiu@nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/ >>> >>> -- >>> Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because >>> either you have not provided a filename >>> or provided an invalid filename. >>> Please see --help for usage. >>> >>> >>> -- >>> >> >> > Try removing the trailing '/' in the command. The current ompi-restart is > not good about differentiating between : > > ompi_global_snapshot_333.ckpt > and > > ompi_global_snapshot_333.ckpt/ > > > Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with >> 1.4 (but then I didn't try 1.4 with a shared filesystem). >> > > I would also suggest trying v1.4 or 1.5 to see if your problems persist > with these versions. > > -- Josh > > > > >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster
Does this happen when you run without '-am ft-enable-cr' (so a no-C/R run)? This will help us determine if your problem is with the C/R work or with the ORTE runtime. I suspect that there is something odd with your system that is confusing the runtime (so not a C/R problem). Have you made sure to remove the previous versions of Open MPI from all machines on your cluster, before installing the new version? Sometimes problems like this come up because of mismatches in Open MPI versions on a machine. -- Josh On Mar 23, 2010, at 5:42 PM, fengguang tian wrote: I met the same problem with this link:http://www.open-mpi.org/community/lists/users/2009/12/11374.php in the link, they give a solution that use v1.4 open mpi instead of v1.3 open mpi. but, I am using v1.7a1r22794 open mpi, and met the same problem. here is what I have done: my cluster composed of two machines:nimbus(master) and nimbus1(slave), when I run mpirun -np 40 -am ft-enable-cr -- hostfile .mpihostfile myapplication on the nimbus, and it doesn't work, it shows: [nimbus1:21387] opal_os_dirpath_create: Error: Unable to create the sub-directory (/tmp/openmpi-sessions-mpiu@nimbus1_0/59759) of (/tmp/ openmpi-sessions-mpiu@nimbus1_0/59759/0/1), mkdir failed [1] [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file util/ session_dir.c at line 106 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file util/ session_dir.c at line 399 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file base/ ess_base_std_orted.c at line 301 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 104 [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID] [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file util/show_help.c at line 602 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file ess_env_module.c at line 143 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 104 [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID] [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file util/show_help.c at line 602 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file runtime/ orte_init.c at line 129 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 104 [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID] [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file util/show_help.c at line 602 [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file orted/ orted_main.c at line 355 -- A daemon (pid 10737) died unexpectedly with status 255 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- cheers fengguang ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI
On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote: On Tue, Mar 23, 2010 at 12:55 PM, fengguang tianwrote: I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir --hostfile .mpihostfile to store the global checkpoint snapshot into the shared directory:/mirror,but the problems are still there, when ompi-checkpoint, the mpirun is still not killed,it is hanging there. So the 'ompi-checkpoint' command does not finish? By default 'ompi- checkpoint' does not terminate the MPI job. If you pass the '--term' option to it, then it will. when doing ompi-restart, it shows: mpiu@nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/ -- Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because either you have not provided a filename or provided an invalid filename. Please see --help for usage. -- Try removing the trailing '/' in the command. The current ompi-restart is not good about differentiating between : ompi_global_snapshot_333.ckpt and ompi_global_snapshot_333.ckpt/ Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with 1.4 (but then I didn't try 1.4 with a shared filesystem). I would also suggest trying v1.4 or 1.5 to see if your problems persist with these versions. -- Josh ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Meaning and the significance of MCA parameter "opal_cr_use_thread"
So the MCA parameter that you mention is explained at the link below: http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_use_thread This enables/disables the C/R thread a runtime if Open MPI was configured with C/R thread support: http://osl.iu.edu/research/ft/ompi-cr/api.php#conf-enable-ft-thread The C/R thread enables asynchronous processing of checkpoint requests when the application process is not inside the MPI library. The purpose of this thread is to improve the responsiveness of the checkpoint operation. Without the thread, if the application is in a computation loop then the checkpoint will be delayed until the process enters the MPI library. With the thread enabled, the checkpoint will start in the C/R thread if the application is not in the MPI library. The primary advantages of the C/R thread are: - Response time to the C/R request since the checkpoint is not delayed until the process enters the MPI library, - Asynchronous processing of the checkpoint while the application is executing outside the MPI library (improves the checkpoint overhead experienced by the process). The primary disadvantage of the C/R thread is the additional processing task running in parallel with the application. If the C/R thread is polling too often it could slow down the main process by forcing frequent context switches between the C/R thread and the main execution thread. You can adjust the aggressiveness by adjusting the parameters at the link below: http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_thread_sleep_check -- Josh On Mar 24, 2010, at 11:24 AM,wrote: The description for MCA parameter “opal_cr_use_thread” is very short at URL: http://osl.iu.edu/research/ft/ompi-cr/api.php Can someone explain the usefulness of enabling this parameter vs disabling it? In other words, what are pros/cons of disabling it? I found that this gets enabled automatically when openmpi library is configured with –ft-enable-threads option. Thanks Ananda Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] mpirun with -am ft-enable-cr option takes longer time on certain configurations
On Mar 20, 2010, at 11:14 PM,wrote: I am observing a very strange performance issue with my openmpi program. I have compute intensive openmpi based application that keeps the data in memory, process the data and then dumps it to GPFS parallel file system. GPFS parallel file system server is connected to a QDR infiniband switch from Voltaire. If my cluster is connected to a DDR infiniband switch which in turn connects to file system server on QDR switch, I see that I can run my application under checkpoint/restart control (with –am ft-enable- cr) and I can checkpoint (ompi-checkpoint) successfully and the application gets completed after few additional seconds. If my cluster is connected to the same QDR switch which connects to file system server, I see that my application takes close to 10x time to complete if I run it under checkpoint/restart control (with – am ft-enable-cr). If I run the same application using a plain mpirun command (ie; without -am ft_enable_cr), it finishes within a minute. The 10x slowdown is without taking a checkpoint, correct? If the checkpoint is taking up part of the bandwidth through the same switch you are communicating with, then you will see diminished performance until the checkpoint is fully established on the storage device(s). Many installations separate the communication and storage networks (or limiting the bandwidth of one of them) to prevent one from unexpectedly demising the performance of the other, even outside of the C/R context. However for a non-checkpointing run to be 10x slower is certainly not normal. Try playing with the C/R thread parameters (mentioned in a previous email) and see if that helps. If not, we might be able to try other things. -- Josh I am using open mpi 1.3.4 and BLCR 0.8.2 for checkpointing Are there any specific MCA parameters that I should tune to address this problem? Any other pointers will be really helpful. Thanks Anand Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] mpirun with -am ft-enable-cr option runs slow if hyperthreading is disabled
On Mar 22, 2010, at 4:41 PM,wrote: Hi If the run my compute intensive openmpi based program using regular invocation of mpirun (ie; mpirun –host -np cores>), it gets completed in few seconds but if I run the same program with “-am ft-enable-cr” option, the program takes 10x time to complete. If I enable hyperthreading on my cluster nodes and then call mpirun with “-am ft-enable-cr” option, the program gets completed with few additional seconds than the normal mpirun!! How can I improve the performance of mpirun with “-am ft-enable-cr” option when I disable hyperthreading on my cluster nodes? Any pointers will be really useful. FYI, I am using openmpi 1.3.4 library and BLCR 0.8.2. Cluster nodes are Nehalem based nodes with 8 cores. I have not done any performance studies focused on hyperthreading, so I can not say specifically what is happening. The 10x slowdown is certainly unexpected (I don't see this in my testing). There usually is a small slowdown (few microseconds) because of the message tracking technique used to support the checkpoint coordination protocol. I suspect that the cause of your problem is the C/R thread which is probably too aggressive for your system. The improvement with hyperthreading may be that this thread is able to sit on one of the hardware threads and not completely steal the CPU from the main application. You can change how aggressive the thread is by adjusting the two parameters below: http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_thread_sleep_check http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_thread_sleep_wait I usually set the latter to: opal_cr_thread_sleep_wait=1000 Give that a try and let me know is that helps. You might also try to upgrade to the 1.4 series, or even the upcoming v1.5.0 release and see if the problem persists there. -- Josh Thanks Anand Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] top command output shows huge CPU utilization when openmpi processes resume after the checkpoint
On Mar 21, 2010, at 12:58 PM, Addepalli, Srirangam V wrote: Yes We have seen this behavior too. Another behavior I have seen is that one MPI process starts to show different elapsed time than its peers. Is it because checkpoint happened on behalf of this process? R From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of ananda.mu...@wipro.com [ananda.mu...@wipro.com] Sent: Saturday, March 20, 2010 10:18 PM To: us...@open-mpi.org Subject: [OMPI users] top command output shows huge CPU utilization whenopenmpi processes resume after the checkpoint When I checkpoint my openmpi application using ompi_checkpoint, I see that top command suddenly shows some really huge numbers in "CPU %" field such as 150% 200% etc. After sometime, these numbers do come back to the normal numbers under 100%. This happens exactly around the time checkpoint is completed and when the processes are resuming the execution. One cause for this type of CPU utilization is due to the C/R thread. During non-checkpoint/normal processing the thread is polling for a checkpoint fairly aggressively. You can change how aggressive the thread is by adjusting the two parameters below: http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_thread_sleep_check http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_thread_sleep_wait I usually set the latter to: opal_cr_thread_sleep_wait=1000 You can also turn off the C/R thread, either by configure'ing without it, or disabling it at runtime by setting the 'opal_cr_use_thread' parameter to '0': http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_use_thread The CPU increase during the checkpoint may be due to both the Open MPI C/R thread, and the BLCR thread becoming active on the machine. You might try to determine whether this is BLCR's CPU utilization or Open MPI's by creating a single process application and watching the CPU utilization when checkpointing with BLCR. You may also want to look at the memory consumption of the process to make sure that there is enough for BLCR to run efficiently. This may also be due to processes finished with the checkpoint waiting on other peer processes to finish. I don't think we have a good way to control how aggressively these waiting processes poll for completion of peers. If this becomes a problem we can look into adding a parameter similar to opal_cr_thread_sleep_wait to throttle the polling on the machine. The disadvantage of making the various polling for completion loops less aggressive, is that the checkpoint may stall the checkpoint and/ or application for a little longer than necessary. But if this is acceptable to the user, then they can adjust the MCA parameters as necessary. Another behavior I have seen is that one MPI process starts to show different elapsed time than its peers. Is it because checkpoint happened on behalf of this process? Can you explain a bit more about what you mean by this? Neither Open MPI nor BLCR messes with the timer on the machine, so we are not changing it in any way. The process is 'stopped' briefly while BLCR takes the checkpoint, so this will extend the running time of the process. How much the running time is extended (a.k.a. checkpoint overhead) is determined by a bunch of things, but primarily by the storage device(s) that the checkpoint is being written to. For your reference, I am using open mpi 1.3.4 and BLCR 0.8.2 for checkpointing. It would be interesting to know if you see the same behavior with the trunk or v1.5 series of Open MPI. Hope that helps, Josh Thanks Anand Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] configuration and compilation outputs
I don't see -static listed in the config.log at all, but I see it listed in the make output that you sent in the first mail. Additionally, the make output that you sent in your mail doesn't seem to match the make.output that you attached in your last email. Are you mixing and matching multiple builds by accident, perchance? FWIW, it's typically best to set flags in configure via the configure command line, like this: ./configure CFLAGS=-static etc... Rather than setenv'ing them before running configure. The (minor) advantage of this is that all the flags are then recorded in the config.log file. If you setenv them, then config.log doesn't show everything. On Mar 29, 2010, at 9:02 AM, Philippe GOURET wrote: > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] configuration and compilation outputs
outputs.tar.gz Description: File Attachment: outputs.tar.gz
Re: [OMPI users] LAM: static
(moving to the Open MPI user's mailing list...) Can you send all the information listed here (please compress!): http://www.open-mpi.org/community/help/ On Mar 29, 2010, at 8:21 AM, Philippe GOURET wrote: > > The make failed with Open MPI: > > > > Making all in tools/wrappers > make[2]: entrant dans le répertoire « > /home/philippe/tmp/openmpi-1.4.1/opal/tools/wrappers » > depbase=`echo opal_wrapper.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\ > gcc "-DEXEEXT=\"\"" -I. -I../../../opal/include > -I../../../orte/include -I../../../ompi/include > -I../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../.. -static > -O3 -DNDEBUG -static -finline-functions -fno-strict-aliasing -pthread > -fvisibility=hidden -MT opal_wrapper.o -MD -MP -MF $depbase.Tpo -c -o > opal_wrapper.o opal_wrapper.c &&\ > mv -f $depbase.Tpo $depbase.Po > /bin/sh ../../../libtool --tag=CC --mode=link gcc -O3 -DNDEBUG -static > -finline-functions -fno-strict-aliasing -pthread -fvisibility=hidden > -export-dynamic -static -o opal_wrapper opal_wrapper.o > ../../../opal/libopen-pal.la -lnsl -lutil -lm > libtool: link: gcc -O3 -DNDEBUG -finline-functions -fno-strict-aliasing > -pthread -fvisibility=hidden -o opal_wrapper opal_wrapper.o > -Wl,--export-dynamic ../../../opal/.libs/libopen-pal.a -ldl -lnsl -lutil -lm > -pthread > ../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o): In function > `opal_mem_free_ptmalloc2_munmap': > opal_ptmalloc2_munmap.c:(.text+0x42): undefined reference to `__munmap' > ../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o): In function > `munmap': > opal_ptmalloc2_munmap.c:(.text+0x8d): undefined reference to `__munmap' > collect2: ld returned 1 exit status > make[2]: *** [opal_wrapper] Erreur 1 > make[2]: quittant le répertoire « > /home/philippe/tmp/openmpi-1.4.1/opal/tools/wrappers » > make[1]: *** [all-recursive] Erreur 1 > make[1]: quittant le répertoire « /home/philippe/tmp/openmpi-1.4.1/opal » > make: *** [all-recursive] Erreur 1 > > Do you know why ? > > Moreover like with lam, the gcc last call doesn't have the "-static" option, > but "libtool" has it ! > > Thanks > > > > > > > > > > > > From: Jeff Squyres> > Sent: Mon Mar 29 14:01:50 CEST 2010 > > To: Philippe GOURET , General LAM/MPI > > mailing list > > Subject: Re: LAM: static > > > > > > It could be that the Libtool included in LAM/MPI is so old that it is not > > passing -static through properly...? > > > > Is it possible for you to upgrade to Open MPI? > > > > > > On Mar 29, 2010, at 7:42 AM, Philippe GOURET wrote: > > > > > > > > Hi > > > > > > I need to deploy a 32-bit version of lam-7.1.4 on a 64-bit computer. > > > So i would like to build lam-7.1.4 in a static way. I just added the > > > "-static" option to some environment variables: CFLAGS, LDFLAGS, > > > CXXLDFLAGS, CXXFLAGS, but when i verify the built runtimes with ldd > > > command i always see: > > > > > > linux-gate.so.1 => (0xe000) > > > libdl.so.2 => /lib/libdl.so.2 (0xb7f6e000) > > > libutil.so.1 => /lib/libutil.so.1 (0xb7f6a000) > > > libpthread.so.0 => /lib/i686/libpthread.so.0 (0xb7f53000) > > > libc.so.6 => /lib/i686/libc.so.6 (0xb7e13000) > > > /lib/ld-linux.so.2 (0xb7f89000) > > > > > > > > > If i look to the "make" trace, for example for lamboot runtime, i see: > > > > > > ... > > > Making all in lamboot > > > make[2]: entrant dans le répertoire « > > > /home/philippe/tmp/lam-7.1.4/tools/lamboot » > > > if gcc -DHAVE_CONFIG_H -I. -I. -I../../share/include > > > -DLAM_SYSCONFDIR="\"/usr/local/etc\"" -DBOOT_MODULES="\"globus rsh > > > slurm\"" -DRPI_MODULES="\"crtcp lamd sysv tcp usysv\"" > > > -DCOLL_MODULES="\"lam_basic shmem smp\"" -I../../share/include -static > > > -DLAM_BUILDING=1 -O3 -static -pthread -MT lamboot.o -MD -MP -MF > > > ".deps/lamboot.Tpo" -c -o lamboot.o lamboot.c; \ > > > then mv -f ".deps/lamboot.Tpo" ".deps/lamboot.Po"; else rm -f > > > ".deps/lamboot.Tpo"; exit 1; fi > > > /bin/sh ../../libtool --tag=CC --mode=link gcc -O3 -static -pthread > > > -static -o lamboot lamboot.o ../../share/liblam/liblam.la -lutil > > > mkdir .libs > > > gcc -O3 -pthread -o lamboot lamboot.o ../../share/liblam/.libs/liblam.a > > > -ldl -lutil > > > make[2]: quittant le répertoire « > > > /home/philippe/tmp/lam-7.1.4/tools/lamboot » > > > ... > > > > > > Did i miss something ? > > > > > > Thanks by advance > > > > > > Best regards > > > Philippe Gouret > > > > > > > > > > > > > > > > > > ___ > > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/ > > > > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > >
Re: [OMPI users] mpi.h file is missing in openmpi
Hi Reuti, Thank you so much. I installed openmpi locally from the source file. It has all header files in the include folder. I could install charmm without any problem. Best regards, Sunita > Hi, > > Am 25.03.2010 um 11:30 schrieb sun...@chem.iitb.ac.in: > >> Openmpi is installed in a Intel Xeon quad core 2.4Ghz machine loaded >> with >> Red Hat Enterprise Linux 5. The loaded openmpi version is 1.2.5. While >> trying to install CHARMM software, it asked for the path of mpi.h file >> and >> library files libmpi. I didn't find 'include' folder in the openmpi >> folder >> which contains all the header files like mpi.h. However, it contains >> 'bin', 'etc', 'lib' and 'share' sub-folders. > > maybe only the runtime package and not the developer package was > installed. Due to the ancient version you have, I would suggest to > download the actual source and install on your own with: > > $ ./configure --prefix=/home/patel/local/openmpi-1.4.1 > > and after a "make" and "make install" you can access the actual header and > library files. For CHARMM ist might be necessary to supply the path to the > include files with -I/home/patel/local/openmpi-1.4.1/include in CFLAGS and > the path to your libs in LDFLAGS -L/home/patel/local/openmpi-1.4.1/lib > (names maybe different in CHARMM though). > > As long as you built the dynamic version, it's necessary to supply a > runtime also an export > LD_LIBRARY_PATH=/home/patel/local/openmpi-1.4.1/lib:$LD_LIBRARY_PATH > > -- Reuti > >> >> It looks like that mpi.h file does not exist. Which version of openmpi >> has >> the mpi.h header file? >> >> Any help would be appreciated. >> >> Regards, >> Dr. Sunita Patel >> - >> Visiting Fellow >> Department of Chemical Sciences >> T.I.F.R., Homi Bhabha Road, Colaba >> Mumbai - 45 >> - >> >> >> >> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >