Re: [OMPI users] users Digest, Vol 1658, Issue 2

2010-08-13 Thread ananda.mudar
Josh I am having problems compiling the sources from the latest trunk. It complains of libgomp.spec missing even though that file exists on my system. I will see if I have to change any other environment variables to have a successful compilation. I will keep you posted. BTW, were you successful

Re: [OMPI users] Checkpointing mpi4py program

2010-08-13 Thread ananda.mudar
Josh I have stack traces of all 8 python processes when I observed the hang after successful completion of checkpoint. They are in the attached document. Please see if these stack traces provide any clue. Thanks Ananda From: Ananda Babu Mudar (WT01 - Energy

Re: [OMPI users] Checkpointing mpi4py program

2010-08-16 Thread ananda.mudar
Josh I tried running the mpi4py program with the latest trunk version of openmpi. I have compiled openmpi-1.7a1r23596 from trunk and recompiled mpi4py to use this library. Unfortunately I see the same behavior as I have seen with openmpi 1.4.2 ie; checkpoint will be successful but the program

Re: [OMPI users] Checkpointing mpi4py program

2010-08-16 Thread ananda.mudar
Josh I have one more update on my observation while analyzing this issue. Just to refresh, I am using openmpi-trunk release 23596 with mpi4py-1.2.1 and BLCR 0.8.2. When I checkpoint the python script written using mpi4py, the program doesn't progress after the checkpoint is taken

Re: [OMPI users] Checkpointing mpi4py program (Probably bcast issue)

2010-08-18 Thread ananda.mudar
Josh Thanks for addressing the issue. I will try the new version that has your fix and let you know. BTW, I have been in touch with mpi4py team also to debug this issue. According to mpi4py team, MPI_Bcast() is implemented with two collective calls: First one with MPI_Bcast() of single

Re: [OMPI users] Checkpointing mpi4py program (Probably bcast issue)

2010-08-20 Thread ananda.mudar
Josh I have few more observations that I want to share with you. I modified the earlier C program little bit by making two MPI_Bcast() calls inside while loop for 10 seconds. The issue of MPI_Bcast() failing with ERR_TRUNCATE error message resurfaces when I call checkpoint on this program.

[OMPI users] MPI_Bcast() Vs paired MPI_Send() & MPI_Recv()

2010-09-01 Thread ananda.mudar
Hi If I replace MPI_Bcast() with a paired MPI_Send() and MPI_Recv() calls, what kind of impact does it have on the performance of the program? Are there any benchmarks of MPI_Bcast() vs paired MPI_Send() and MPI_Recv()?? Thanks Ananda Please do not print this email unless it is absolutely

[OMPI users] Question on staging in checkpoint

2010-09-13 Thread ananda.mudar
Hi I was trying out the staging option in checkpoint where I save the checkpoint image in local file system and have the image transferred to global filesystem in the background. As part of the background process I see that the "scp" command is launched to transfer the images from local file

[OMPI users] mpirun with -am ft-enable-cr option takes longer time on certain configurations

2010-03-21 Thread ananda.mudar
I am observing a very strange performance issue with my openmpi program. I have compute intensive openmpi based application that keeps the data in memory, process the data and then dumps it to GPFS parallel file system. GPFS parallel file system server is connected to a QDR infiniband switch

[OMPI users] top command output shows huge CPU utilization when openmpi processes resume after the checkpoint

2010-03-21 Thread ananda.mudar
When I checkpoint my openmpi application using ompi_checkpoint, I see that top command suddenly shows some really huge numbers in "CPU %" field such as 150% 200% etc. After sometime, these numbers do come back to the normal numbers under 100%. This happens exactly around the time checkpoint is

[OMPI users] mpirun with -am ft-enable-cr option runs slow if hyperthreading is disabled

2010-03-22 Thread ananda.mudar
Hi If the run my compute intensive openmpi based program using regular invocation of mpirun (ie; mpirun -host -np ), it gets completed in few seconds but if I run the same program with "-am ft-enable-cr" option, the program takes 10x time to complete. If I enable hyperthreading on my

[OMPI users] Meaning and the significance of MCA parameter "opal_cr_use_thread"

2010-03-24 Thread ananda.mudar
The description for MCA parameter "opal_cr_use_thread" is very short at URL: http://osl.iu.edu/research/ft/ompi-cr/api.php Can someone explain the usefulness of enabling this parameter vs disabling it? In other words, what are pros/cons of disabling it? I found that this gets enabled

[OMPI users] ompi-checkpoint fails sometimes

2010-05-11 Thread ananda.mudar
Hi I am using open-mpi 1.3.4 with BLCR. Sometimes I am running into a strange problem with ompi-checkpoint command. Even though I see that all MPI processes (equal to np argument) are running, ompi-checkpoint command fails at times. I have seen this failure always when the MPI processes spawned

[OMPI users] opal_cr_tmp_dir

2010-05-12 Thread ananda.mudar
I am setting the MCA parameter "opal_cr_tmp_dir" to a directory other than /tmp while calling "mpirun", "ompi-restart", and "ompi-checkpoint" commands so that I don't fill up /tmp filesystem. But I see that openmpi-sessions* directory is still getting created under /tmp. How do I overcome this

Re: [OMPI users] opal_cr_tmp_dir

2010-05-12 Thread ananda.mudar
Thanks Ralph. Another question. Even though I am setting opal_cr_tmp_dir to a directory other than /tmp while calling ompi-restart command, this setting is not getting passed to the mpirun command that gets generated by ompi-restart. How do I overcome this constraint? Thanks Ananda

Re: [OMPI users] opal_cr_tmp_dir

2010-05-12 Thread ananda.mudar
Ralph I have these parameters set in ~/.openmpi/mca-params.conf file $ cat ~/.openmpi/mca-params.conf orte_tmpdir_base = /home/ananda/ORTE opal_cr_tmp_dir = /home/ananda/OPAL $ Should I be setting OMPI_MCA_opal_cr_tmp_dir? FYI, I am using openmpi 1.3.4 with blcr 0.8.2 Thanks Ananda

[OMPI users] (no subject)

2010-05-12 Thread ananda.mudar
Ralph When you say manually, do you mean setting these parameters in the command line while calling mpirun, ompi-restart, and ompi-checkpoint? Or is there another way to set these parameters? Thanks Ananda == Subject: Re: [OMPI users] opal_cr_tmp_dir From: Ralph Castain

Re: [OMPI users] opal_cr_tmp_dir

2010-05-12 Thread ananda.mudar
Ralph When you say manually, do you mean setting these parameters in the command line while calling mpirun, ompi-restart, and ompi-checkpoint? Or is there another way to set these parameters? Thanks Ananda == Subject: Re: [OMPI users] opal_cr_tmp_dir From: Ralph Castain

Re: [OMPI users] opal_cr_tmp_dir

2010-05-13 Thread ananda.mudar
Ralph Defining these parameters in my environment also did not resolve the problem. Whenever I restart my program, the temporary files are getting stored in the default /tmp directory instead of the directory I had defined. Thanks Ananda = Subject: Re: [OMPI users]

[OMPI users] ompi-restart fails with "found pid in use"

2010-05-14 Thread ananda.mudar
Hi I am using open mpi v1.3.4 with BLCR 0.8.2. I have been testing my openmpi based program on a 3-node cluster (each node is a Intel Nehalem based dual quad core) and I have been successful in checkpointing and restarting the program successfully multiple times. Recently I moved to a 15

Re: [OMPI users] opal_cr_tmp_dir

2010-05-18 Thread ananda.mudar
That's correct. I have prefixed them with OMPI_MCA_ when I defined them in my environment. Despite that I still see some of these files being created under the default directory /tmp which is different from what I had set. Thanks Ananda From: Josh Hursey

[OMPI users] Checkpointing mpi4py program

2010-08-09 Thread ananda.mudar
Hi I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR 0.8.2. When I run ompi-checkpoint on the program written using mpi4py, I see that program doesn't resume sometimes after successful checkpoint creation. This doesn't occur always meaning the program resumes after successful