[OMPI users] Question on staging in checkpoint

2010-09-13 Thread ananda.mudar
Hi



I was trying out the staging option in checkpoint where I save the
checkpoint image in local file system and have the image transferred to
global filesystem in the background. As part of the background process I
see that the "scp" command is launched to transfer the images from local
file system to global file system. I am using openmpi-1.5rc6 with BLCR
0.8.2.



In my experiment, I had about 128 cores saved their respective
checkpoint images on local file system. During the background process, I
see that only 10 "scp" requests are sent at a time. Is this a
configurable parameter? Since these commands will run on respective
nodes, how can I launch all 128 scp requests (to take care of all 128
images in my experiment) simultaneously?



Thanks

Ananda


Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


[OMPI users] MPI_Bcast() Vs paired MPI_Send() & MPI_Recv()

2010-09-01 Thread ananda.mudar
Hi



If I replace MPI_Bcast() with a paired MPI_Send() and MPI_Recv() calls,
what kind of impact does it have on the performance of the program? Are
there any benchmarks of MPI_Bcast() vs paired MPI_Send() and
MPI_Recv()??



Thanks

Ananda


Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


Re: [OMPI users] Checkpointing mpi4py program (Probably bcast issue)

2010-08-20 Thread ananda.mudar
Josh

I have few more observations that I want to share with you.

I modified the earlier C program little bit by making two MPI_Bcast() calls 
inside while loop for 10 seconds. The issue of MPI_Bcast() failing with 
ERR_TRUNCATE error message resurfaces when I call checkpoint on this program. 
Interestingly the two MPI_Bcast() calls are broadcasting different data types 
ie; first one broadcasts integer variable and the second one broadcasts float 
variable.

If I make these two MPI_Bcast() calls to broadcast the same data type ie; 
either broadcast two different integer variables one after another or broadcast 
two different float variables one after another, the program continues 
successfully. Checkpoint command is successful all the times and the program 
resumes after successful checkpoint.

When the MPI_Bcast() failed with ERR_TRUNCATE error message, I have captured 
the output after setting "--mca crcp_base_verbose 20 --mca orte_debug_verbose 
20".  I have filtered all the messages before and after the error message 
occurred so that it will not have clutter.

I am attaching both the C program (little modified from the earlier one I 
shared with you) and the filtered output log file with this thread. Hope you 
see something with these messages that might be going wrong.

Please let me know if you need any additional information on this issue.

Thanks
Ananda


Sent: Wed 8/18/2010 4:43 PM
To: 'us...@open-mpi.org'
Subject: Re: [OMPI users] Checkpointing mpi4py program (Probably bcast issue)



Josh



Thanks for addressing the issue. I will try the new version that has your fix 
and let you know.



BTW, I have been in touch with mpi4py team also to debug this issue. According 
to mpi4py team, MPI_Bcast() is implemented with two collective calls: First one 
with MPI_Bcast() of single integer and the next one with MPI_Bcast() chunk of 
memory. Since the problem I was running into was during MPI_Bcast() calls, I 
have mimic'd the mpi4py logic and wrote a program in c. I have attached the 
same with this mail for your reference.



If you run this program without checkpoint control, program runs for ever 
because of the infinite loop inside. However if I run this program under 
checkpoint control (mpirun -am ft-enable-cr), occasionally it fails with the 
following messages:

=== Error message START ==

[Host1:7398] *** An error occurred in MPI_Bcast

[Host1:7398] *** on communicator MPI_COMM_WORLD

[Host1:7398] *** MPI_ERR_TRUNCATE: message truncated

[Host1:7398] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

=== Error message END ==



While running mpi4py program, I used to get these errors occasionally from 
cPickle().



I tried this program with OpenMPI 1.4.2, OpenMPI trunk versions and the 
behavior is same.



I have not hit the hang condition I had seen while checkpointing mpi4py program 
but I hope this issue may be manifesting hang condition at times!!



Let me know if you need any other information.



Thanks

Ananda





Ananda B Mudar, PMP

Senior Technical Architect

Wipro Technologies

Ph: 972 765 8093

ananda.mu...@wipro.com

--- Original Message ---

Subject: Re: [OMPI users] Checkpointing mpi4py program
From: Joshua Hursey (jjhursey_at_[hidden])
List-Post: users@lists.open-mpi.org
Date: 2010-08-18 16:48:17

I just fixed the --stop bug that you highlighted in r23627.

As far as the mpi4py program, I don't really know what to suggest. I don't have 
a setup to test this locally and am completely unfamiliar with mpi4py. Can you 
reproduce this with just a C program?

-- Josh

On Aug 16, 2010, at 12:25 PM,  
 wrote:

> Josh
>
> I have one more update on my observation while analyzing this issue.
>
> Just to refresh, I am using openmpi-trunk release 23596 with mpi4py-1.2.1 and 
> BLCR 0.8.2. When I checkpoint the python script written using mpi4py, the 
> program doesn't progress after the checkpoint is taken successfully. I tried 
> it with openmpi 1.4.2 and then tried it with the latest trunk version as 
> suggested. I see the similar behavior in both the releases.
>
> I have one more interesting observation which I thought may be useful. I 
> tried the "-stop" option of ompi-checkpoint (trunk version) and the mpirun 
> prints the following error messages when I run the command "ompi-checkpoint 
> -stop -v ":
>
>  Error messages in the window where mpirun command was running START 
> ==
> [hpdcnln001:15148] Error: ( app) Passed an invalid handle (0) [5 
> ="/tmp/openmpi-sessions-amudar_at_hpdcnln001_0/37739/1"]
> [hpdcnln001:15148] [[37739,1],2] ORTE_ERROR_LOG: Error in file 
> ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253
> [hpdcnln001:15149] Error: ( app) Passed an invalid handle (0) [5 
> 

Re: [OMPI users] Checkpointing mpi4py program (Probably bcast issue)

2010-08-18 Thread ananda.mudar
Josh



Thanks for addressing the issue. I will try the new version that has
your fix and let you know.



BTW, I have been in touch with mpi4py team also to debug this issue.
According to mpi4py team, MPI_Bcast() is implemented with two collective
calls: First one with MPI_Bcast() of single integer and the next one
with MPI_Bcast() chunk of memory. Since the problem I was running into
was during MPI_Bcast() calls, I have mimic'd the mpi4py logic and wrote
a program in c. I have attached the same with this mail for your
reference.



If you run this program without checkpoint control, program runs for
ever because of the infinite loop inside. However if I run this program
under checkpoint control (mpirun -am ft-enable-cr), occasionally it
fails with the following messages:

=== Error message START ==

[Host1:7398] *** An error occurred in MPI_Bcast

[Host1:7398] *** on communicator MPI_COMM_WORLD

[Host1:7398] *** MPI_ERR_TRUNCATE: message truncated

[Host1:7398] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

=== Error message END ==



While running mpi4py program, I used to get these errors occasionally
from cPickle().



I tried this program with OpenMPI 1.4.2, OpenMPI trunk versions and the
behavior is same.



I have not hit the hang condition I had seen while checkpointing mpi4py
program but I hope this issue may be manifesting hang condition at
times!!



Let me know if you need any other information.



Thanks

Ananda





Ananda B Mudar, PMP

Senior Technical Architect

Wipro Technologies

Ph: 972 765 8093

ananda.mu...@wipro.com

--- Original Message ---

Subject: Re: [OMPI users] Checkpointing mpi4py program
From: Joshua Hursey (jjhursey_at_[hidden])
List-Post: users@lists.open-mpi.org
Date: 2010-08-18 16:48:17

I just fixed the --stop bug that you highlighted in r23627.

As far as the mpi4py program, I don't really know what to suggest. I
don't have a setup to test this locally and am completely unfamiliar
with mpi4py. Can you reproduce this with just a C program?

-- Josh

On Aug 16, 2010, at 12:25 PM, 
 wrote:

> Josh
>
> I have one more update on my observation while analyzing this issue.
>
> Just to refresh, I am using openmpi-trunk release 23596 with
mpi4py-1.2.1 and BLCR 0.8.2. When I checkpoint the python script written
using mpi4py, the program doesn't progress after the checkpoint is taken
successfully. I tried it with openmpi 1.4.2 and then tried it with the
latest trunk version as suggested. I see the similar behavior in both
the releases.
>
> I have one more interesting observation which I thought may be useful.
I tried the "-stop" option of ompi-checkpoint (trunk version) and the
mpirun prints the following error messages when I run the command
"ompi-checkpoint -stop -v ":
>
>  Error messages in the window where mpirun command was running
START ==
> [hpdcnln001:15148] Error: ( app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar_at_hpdcnln001_0/37739/1"]
> [hpdcnln001:15148] [[37739,1],2] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253
> [hpdcnln001:15149] Error: ( app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar_at_hpdcnln001_0/37739/1"]
> [hpdcnln001:15149] [[37739,1],3] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253
> [hpdcnln001:15146] Error: ( app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar_at_hpdcnln001_0/37739/1"]
> [hpdcnln001:15146] [[37739,1],0] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253
> [hpdcnln001:15147] Error: ( app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar_at_hpdcnln001_0/37739/1"]
> [hpdcnln001:15147] [[37739,1],1] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253
>  Error messages in the window where mpirun command was running END
==
>
> Please note that the checkpoint image was created at the end of it.
However when I run the command "kill -CONT ", it fails to
move forward which is same as the original problem I have reported.
>
> Let me know if you need any additional information.
>
> Thanks for your time in advance
>
> - Ananda
>
> Ananda B Mudar, PMP
> Senior Technical Architect
> Wipro Technologies
> Ph: 972 765 8093 begin_of_the_skype_highlighting  972 765
8093  end_of_the_skype_highlighting
> ananda.mudar_at_[hidden]
>
> From: Ananda Babu Mudar
> Sent: Sunday, August 15, 2010 11:25 PM
> To: users_at_[hidden]
> Subject: Re: [OMPI users] Checkpointing mpi4py program
> Importance: High
>
> Josh
>
> I tried running the mpi4py program with the latest trunk version of
openmpi. I have compiled 

Re: [OMPI users] Checkpointing mpi4py program

2010-08-16 Thread ananda.mudar
Josh



I have one more update on my observation while analyzing this issue.



Just to refresh, I am using openmpi-trunk release 23596 with
mpi4py-1.2.1 and BLCR 0.8.2. When I checkpoint the python script written
using mpi4py, the program doesn't progress after the checkpoint is taken
successfully. I tried it with openmpi 1.4.2 and then tried it with the
latest trunk version as suggested. I see the similar behavior in both
the releases.



I have one more interesting observation which I thought may be useful. I
tried the "-stop" option of ompi-checkpoint (trunk version) and the
mpirun prints the following error messages when I run the command
"ompi-checkpoint -stop -v ":



 Error messages in the window where mpirun command was running START
==

[hpdcnln001:15148] Error: (   app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"]

[hpdcnln001:15148] [[37739,1],2] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253

[hpdcnln001:15149] Error: (   app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"]

[hpdcnln001:15149] [[37739,1],3] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253

[hpdcnln001:15146] Error: (   app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"]

[hpdcnln001:15146] [[37739,1],0] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253

[hpdcnln001:15147] Error: (   app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"]

[hpdcnln001:15147] [[37739,1],1] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253

 Error messages in the window where mpirun command was running END
==



Please note that the checkpoint image was created at the end of it.
However when I run the command "kill -CONT ", it fails to
move forward which is same as the original problem I have reported.



Let me know if you need any additional information.



Thanks for your time in advance



-  Ananda



Ananda B Mudar, PMP

Senior Technical Architect

Wipro Technologies

Ph: 972 765 8093

ananda.mu...@wipro.com



From: Ananda Babu Mudar (WT01 - Energy and Utilities)
Sent: Sunday, August 15, 2010 11:25 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] Checkpointing mpi4py program
Importance: High



Josh

I tried running the mpi4py program with the latest trunk version of
openmpi. I have compiled openmpi-1.7a1r23596 from trunk and recompiled
mpi4py to use this library. Unfortunately I see the same behavior as I
have seen with openmpi 1.4.2 ie; checkpoint will be successful but the
program doesn't proceed after that.

I have attached the stack traces of all the MPI processes that are part
of the mpirun. I really appreciate if you can take a look at the stack
trace and let m e know the potential problem. I am kind of stuck at this
point and need your assistance to move forward. Please let me know if
you need any additional information.

Thanks for your time in advance

Thanks

Ananda

-Original Message-
Subject: Re: [OMPI users] Checkpointing mpi4py program
From: Joshua Hursey (jjhursey_at_[hidden])
List-Post: users@lists.open-mpi.org
Date: 2010-08-13 12:28:31

Nope. I probably won't get to it for a while. I'll let you know if I do.


On Aug 13, 2010, at 12:17 PM, 
 wrote:

> OK, I will do that.
>
> But did you try this program on a system where the latest trunk is
> installed? Were you successful in checkpointing?
>
> - Ananda
> -Original Message-
> Message: 9
> Date: Fri, 13 Aug 2010 10:21:29 -0400
> From: Joshua Hursey 
> Subject: Re: [OMPI users] users Digest, Vol 1658, Issue 2
> To: Open MPI Users 
> Message-ID: <7A43615B-A462-4C72-8112-496653D8F0A0_at_[hidden]>
> Content-Type: text/plain; charset=us-ascii
>
> I probably won't have an opportunity to work on reproducing this on
the
> 1.4.2. The trunk has a bunch of bug fixes that probably will not be
> backported to the 1.4 series (things have changed too much since that
> branch). So I would suggest trying the 1.5 series.
>
> -- Josh
>
> On Aug 13, 2010, at 10:12 AM, 
>  wrote:
>
>> Josh
>>
>> I am having problems compiling the sources from the latest trunk. It
>> complains of libgomp.spec missing even though that file exists on my
>> system. I will see if I have to change any other environment
variables
>> to have a successful compilation. I will keep you posted.
>>
>> BTW, were you successful in reproducing the problem on a system with
>> OpenMPI 1.4.2?
>>
>> Thanks
>> Ananda
>> -Original Message-
>> Date: Thu, 12 Aug 2010 09:12:26 -0400
>> From: Joshua Hursey 

Re: [OMPI users] Checkpointing mpi4py program

2010-08-16 Thread ananda.mudar
Josh

I tried running the mpi4py program with the latest trunk version of
openmpi. I have compiled openmpi-1.7a1r23596 from trunk and recompiled
mpi4py to use this library. Unfortunately I see the same behavior as I
have seen with openmpi 1.4.2 ie; checkpoint will be successful but the
program doesn't proceed after that.

I have attached the stack traces of all the MPI processes that are part
of the mpirun. I really appreciate if you can take a look at the stack
trace and let m e know the potential problem. I am kind of stuck at this
point and need your assistance to move forward. Please let me know if
you need any additional information.

Thanks for your time in advance

Thanks

Ananda

-Original Message-
Subject: Re: [OMPI users] Checkpointing mpi4py program
From: Joshua Hursey (jjhursey_at_[hidden])
List-Post: users@lists.open-mpi.org
Date: 2010-08-13 12:28:31

Nope. I probably won't get to it for a while. I'll let you know if I do.


On Aug 13, 2010, at 12:17 PM, 
 wrote:

> OK, I will do that.
>
> But did you try this program on a system where the latest trunk is
> installed? Were you successful in checkpointing?
>
> - Ananda
> -Original Message-
> Message: 9
> Date: Fri, 13 Aug 2010 10:21:29 -0400
> From: Joshua Hursey 
> Subject: Re: [OMPI users] users Digest, Vol 1658, Issue 2
> To: Open MPI Users 
> Message-ID: <7A43615B-A462-4C72-8112-496653D8F0A0_at_[hidden]>
> Content-Type: text/plain; charset=us-ascii
>
> I probably won't have an opportunity to work on reproducing this on
the
> 1.4.2. The trunk has a bunch of bug fixes that probably will not be
> backported to the 1.4 series (things have changed too much since that
> branch). So I would suggest trying the 1.5 series.
>
> -- Josh
>
> On Aug 13, 2010, at 10:12 AM, 
>  wrote:
>
>> Josh
>>
>> I am having problems compiling the sources from the latest trunk. It
>> complains of libgomp.spec missing even though that file exists on my
>> system. I will see if I have to change any other environment
variables
>> to have a successful compilation. I will keep you posted.
>>
>> BTW, were you successful in reproducing the problem on a system with
>> OpenMPI 1.4.2?
>>
>> Thanks
>> Ananda
>> -Original Message-
>> Date: Thu, 12 Aug 2010 09:12:26 -0400
>> From: Joshua Hursey 
>> Subject: Re: [OMPI users] Checkpointing mpi4py program
>> To: Open MPI Users 
>> Message-ID: <1F1445AB-9208-4EF0-AF25-5926BD53C7E1_at_[hidden]>
>> Content-Type: text/plain; charset=us-ascii
>>
>> Can you try this with the current trunk (r23587 or later)?
>>
>> I just added a number of new features and bug fixes, and I would be
>> interested to see if it fixes the problem. In particular I suspect
> that
>> this might be related to the Init/Finalize bounding of the checkpoint

>> region.
>>
>> -- Josh
>>
>> On Aug 10, 2010, at 2:18 PM, 
>>  wrote:
>>
>>> Josh
>>>
>>> Please find attached is the python program that reproduces the hang
>> that
>>> I described. Initial part of this file describes the prerequisite
>>> modules and the steps to reproduce the problem. Please let me know
if
>>> you have any questions in reproducing the hang.
>>>
>>> Please note that, if I add the following lines at the end of the
>> program
>>> (in case sleep_time is True), the problem disappears ie; program
>> resumes
>>> successfully after successful completion of checkpoint.
>>> # Add following lines at the end for sleep_time is True
>>> else:
>>> time.sleep(0.1)
>>> # End of added lines
>>>
>>>
>>> Thanks a lot for your time in looking into this issue.
>>>
>>> Regards
>>> Ananda
>>>
>>> Ananda B Mudar, PMP
>>> Senior Technical Architect
>>> Wipro Technologies
>>> Ph: 972 765 8093 begin_of_the_skype_highlighting  972
765 8093  end_of_the_skype_highlighting
>>> ananda.mudar_at_[hidden]
>>>
>>>
>>> -Original Message-
>>> Date: Mon, 9 Aug 2010 16:37:58 -0400
>>> From: Joshua Hursey 
>>> Subject: Re: [OMPI users] Checkpointing mpi4py program
>>> To: Open MPI Users 
>>> Message-ID: <270BD450-743A-4662-9568-1FEDFCC6F9C6_at_[hidden]>
>>> Content-Type: text/plain; charset=windows-1252
>>>
>>> I have not tried to checkpoint an mpi4py application, so I cannot
say
>>> for sure if it works or not. You might be hitting something with the

>>> Python runtime interacting in an odd way with either Open MPI or
> BLCR.
>>>
>>> Can you attach a debugger and get a backtrace on a stuck checkpoint?

>>> That might show us where things are held up.
>>>
>>> -- Josh
>>>
>>>
>>> On Aug 9, 2010, at 4:04 PM, 
>>>  wrote:
>>>
 Hi

 I have integrated mpi4py with openmpi 1.4.2 that was built with
BLCR
>>> 0.8.2. When I run ompi-checkpoint on the program 

Re: [OMPI users] Checkpointing mpi4py program

2010-08-13 Thread ananda.mudar
Josh

I have stack traces of all 8 python processes when I observed the hang after 
successful completion of checkpoint. They are in the attached document. Please 
see if these stack traces provide any clue.

Thanks
Ananda



From: Ananda Babu Mudar (WT01 - Energy and Utilities)
Sent: Fri 8/13/2010 9:12 AM
To: us...@open-mpi.org
Subject: RE: users Digest, Vol 1658, Issue 2



Josh

I am having problems compiling the sources from the latest trunk. It complains 
of libgomp.spec missing even though that file exists on my system. I will see 
if I have to change any other environment variables to have a successful 
compilation. I will keep you posted.

BTW, were you successful in reproducing the problem on a system with OpenMPI 
1.4.2?

Thanks
Ananda
-Original Message-
List-Post: users@lists.open-mpi.org
Date: Thu, 12 Aug 2010 09:12:26 -0400
From: Joshua Hursey 
Subject: Re: [OMPI users] Checkpointing mpi4py program
To: Open MPI Users 
Message-ID: <1f1445ab-9208-4ef0-af25-5926bd53c...@open-mpi.org>
Content-Type: text/plain; charset=us-ascii

Can you try this with the current trunk (r23587 or later)?

I just added a number of new features and bug fixes, and I would be interested 
to see if it fixes the problem. In particular I suspect that this might be 
related to the Init/Finalize bounding of the checkpoint region.

-- Josh

On Aug 10, 2010, at 2:18 PM,   
wrote:

> Josh
>
> Please find attached is the python program that reproduces the hang that
> I described. Initial part of this file describes the prerequisite
> modules and the steps to reproduce the problem. Please let me know if
> you have any questions in reproducing the hang.
>
> Please note that, if I add the following lines at the end of the program
> (in case sleep_time is True), the problem disappears ie; program resumes
> successfully after successful completion of checkpoint.
> # Add following lines at the end for sleep_time is True
> else:
>   time.sleep(0.1)
> # End of added lines
>
>
> Thanks a lot for your time in looking into this issue.
>
> Regards
> Ananda
>
> Ananda B Mudar, PMP
> Senior Technical Architect
> Wipro Technologies
> Ph: 972 765 8093
> ananda.mu...@wipro.com
>
>
> -Original Message-
> Date: Mon, 9 Aug 2010 16:37:58 -0400
> From: Joshua Hursey 
> Subject: Re: [OMPI users] Checkpointing mpi4py program
> To: Open MPI Users 
> Message-ID: <270bd450-743a-4662-9568-1fedfcc6f...@open-mpi.org>
> Content-Type: text/plain; charset=windows-1252
>
> I have not tried to checkpoint an mpi4py application, so I cannot say
> for sure if it works or not. You might be hitting something with the
> Python runtime interacting in an odd way with either Open MPI or BLCR.
>
> Can you attach a debugger and get a backtrace on a stuck checkpoint?
> That might show us where things are held up.
>
> -- Josh
>
>
> On Aug 9, 2010, at 4:04 PM, 
>  wrote:
>
>> Hi
>>
>> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR
> 0.8.2. When I run ompi-checkpoint on the program written using mpi4py, I
> see that program doesn?t resume sometimes after successful checkpoint
> creation. This doesn?t occur always meaning the program resumes after
> successful checkpoint creation most of the time and completes
> successfully. Has anyone tested the checkpoint/restart functionality
> with mpi4py programs? Are there any best practices that I should keep in
> mind while checkpointing mpi4py programs?
>>
>> Thanks for your time
>> -  Ananda
>> Please do not print this email unless it is absolutely necessary.
>>
>> The information contained in this electronic message and any
> attachments to this message are intended for the exclusive use of the
> addressee(s) and may contain proprietary, confidential or privileged
> information. If you are not the intended recipient, you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately and destroy all copies of this message and any attachments.
>>
>> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>>
>> www.wipro.com
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users



Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 

Re: [OMPI users] users Digest, Vol 1658, Issue 2

2010-08-13 Thread ananda.mudar
Josh

I am having problems compiling the sources from the latest trunk. It
complains of libgomp.spec missing even though that file exists on my
system. I will see if I have to change any other environment variables
to have a successful compilation. I will keep you posted.

BTW, were you successful in reproducing the problem on a system with
OpenMPI 1.4.2?

Thanks
Ananda
-Original Message-
List-Post: users@lists.open-mpi.org
Date: Thu, 12 Aug 2010 09:12:26 -0400
From: Joshua Hursey 
Subject: Re: [OMPI users] Checkpointing mpi4py program
To: Open MPI Users 
Message-ID: <1f1445ab-9208-4ef0-af25-5926bd53c...@open-mpi.org>
Content-Type: text/plain; charset=us-ascii

Can you try this with the current trunk (r23587 or later)?

I just added a number of new features and bug fixes, and I would be
interested to see if it fixes the problem. In particular I suspect that
this might be related to the Init/Finalize bounding of the checkpoint
region.

-- Josh

On Aug 10, 2010, at 2:18 PM, 
 wrote:

> Josh
> 
> Please find attached is the python program that reproduces the hang
that
> I described. Initial part of this file describes the prerequisite
> modules and the steps to reproduce the problem. Please let me know if
> you have any questions in reproducing the hang.
> 
> Please note that, if I add the following lines at the end of the
program
> (in case sleep_time is True), the problem disappears ie; program
resumes
> successfully after successful completion of checkpoint.
> # Add following lines at the end for sleep_time is True
> else:
>   time.sleep(0.1)
> # End of added lines
> 
> 
> Thanks a lot for your time in looking into this issue.
> 
> Regards
> Ananda
> 
> Ananda B Mudar, PMP
> Senior Technical Architect
> Wipro Technologies
> Ph: 972 765 8093
> ananda.mu...@wipro.com
> 
> 
> -Original Message-
> Date: Mon, 9 Aug 2010 16:37:58 -0400
> From: Joshua Hursey 
> Subject: Re: [OMPI users] Checkpointing mpi4py program
> To: Open MPI Users 
> Message-ID: <270bd450-743a-4662-9568-1fedfcc6f...@open-mpi.org>
> Content-Type: text/plain; charset=windows-1252
> 
> I have not tried to checkpoint an mpi4py application, so I cannot say
> for sure if it works or not. You might be hitting something with the
> Python runtime interacting in an odd way with either Open MPI or BLCR.
> 
> Can you attach a debugger and get a backtrace on a stuck checkpoint?
> That might show us where things are held up.
> 
> -- Josh
> 
> 
> On Aug 9, 2010, at 4:04 PM, 
>  wrote:
> 
>> Hi
>> 
>> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR
> 0.8.2. When I run ompi-checkpoint on the program written using mpi4py,
I
> see that program doesn?t resume sometimes after successful checkpoint
> creation. This doesn?t occur always meaning the program resumes after
> successful checkpoint creation most of the time and completes
> successfully. Has anyone tested the checkpoint/restart functionality
> with mpi4py programs? Are there any best practices that I should keep
in
> mind while checkpointing mpi4py programs?
>> 
>> Thanks for your time
>> -  Ananda
>> Please do not print this email unless it is absolutely necessary.
>> 
>> The information contained in this electronic message and any
> attachments to this message are intended for the exclusive use of the
> addressee(s) and may contain proprietary, confidential or privileged
> information. If you are not the intended recipient, you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately and destroy all copies of this message and any
attachments.
>> 
>> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of
viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>> 
>> www.wipro.com
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users

Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com



[OMPI users] Checkpointing mpi4py program

2010-08-09 Thread ananda.mudar
Hi



I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR
0.8.2. When I run ompi-checkpoint on the program written using mpi4py, I
see that program doesn't resume sometimes after successful checkpoint
creation. This doesn't occur always meaning the program resumes after
successful checkpoint creation most of the time and completes
successfully. Has anyone tested the checkpoint/restart functionality
with mpi4py programs? Are there any best practices that I should keep in
mind while checkpointing mpi4py programs?



Thanks for your time

-  Ananda


Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


Re: [OMPI users] opal_cr_tmp_dir

2010-05-18 Thread ananda.mudar
That's correct. I have prefixed them with OMPI_MCA_ when I defined them
in my environment. Despite that I still see some of these files being
created under the default directory /tmp which is different from what I
had set.



Thanks

Ananda



From: Josh Hursey 

Subject: Re: [OMPI users] opal_cr_tmp_dir

To: Open MPI Users 

Message-ID: 

Content-Type: text/plain; charset=WINDOWS-1252; format=flowed;

  delsp=yes



When you defined them in your environment did you prefix them with
'OMPI_MCA_'? Open MPI looks for this prefix to identify which parameters
are intended for it specifically.



-- Josh



On May 12, 2010, at 11:09 PM, 
 > wrote:



> Ralph

>

> Defining these parameters in my environment also did not resolve the

> problem. Whenever I restart my program, the temporary files are

> getting stored in the default /tmp directory instead of the directory

> I had defined.

>

> Thanks

>

> Ananda

>

> =

>

> Subject: Re: [OMPI users] opal_cr_tmp_dir

> From: Ralph Castain (rhc_at_[hidden])

> Date: 2010-05-12 19:48:16

>

> ? Previous message: ananda.mudar_at_[hidden]: "Re: [OMPI users]

> opal_cr_tmp_dir"

> ? In reply to: ananda.mudar_at_[hidden]: "Re: [OMPI users]

> opal_cr_tmp_dir"

> Define them in your environment prior to executing any of those

> commands.

>

> On May 12, 2010, at 4:43 PM,  wrote:

>

> > Ralph

> >

> > When you say manually, do you mean setting these parameters in the

> command line while calling mpirun, ompi-restart, and ompi- checkpoint?


> Or is there another way to set these parameters?

> >

> > Thanks

> >

> > Ananda

> >

> > ==

> >

> > Subject: Re: [OMPI users] opal_cr_tmp_dir

> > From: Ralph Castain (rhc_at_[hidden])

> > Date: 2010-05-12 18:09:17

> >

> > Previous message: ananda.mudar_at_[hidden]: "Re: [OMPI users]

> opal_cr_tmp_dir"

> > In reply to: ananda.mudar_at_[hidden]: "Re: [OMPI users]

> opal_cr_tmp_dir"

> > You shouldn't have to, but there may be a bug in the system. Try

> manually setting both envars and see if it fixes the problem.

> >

> > On May 12, 2010, at 3:59 PM,  wrote:

> >

> > > Ralph

> > >

> > > I have these parameters set in ~/.openmpi/mca-params.conf file

> > >

> > > $ cat ~/.openmpi/mca-params.conf

> > >

> > > orte_tmpdir_base = /home/ananda/ORTE

> > >

> > > opal_cr_tmp_dir = /home/ananda/OPAL

> > >

> > > $

> > >

> > >

> > >

> > > Should I be setting OMPI_MCA_opal_cr_tmp_dir?

> > >

> > >

> > >

> > > FYI, I am using openmpi 1.3.4 with blcr 0.8.2

> > >

> > >

> > > Thanks

> > >

> > > Ananda

> > >

> > > =

> > >

> > > Subject: Re: [OMPI users] opal_cr_tmp_dir

> > > From: Ralph Castain (rhc_at_[hidden])

> > > Date: 2010-05-12 16:47:16

> > >

> > > Previous message: Jeff Squyres: "Re: [OMPI users] getc in openmpi"

> > > In reply to: ananda.mudar_at_[hidden]: "Re: [OMPI users]

> opal_cr_tmp_dir"

> > > ompi-restart just does a fork/exec of the mpirun, so it should

> get the param if it is in your environ. How are you setting it? Have

> you tried adding OMPI_MCA_opal_cr_tmp_dir= to your

> environment?

> > >

> > > On May 12, 2010, at 12:45 PM,  wrote:

> > >

> > > > Thanks Ralph.

> > > >

> > > > Another question. Even though I am setting opal_cr_tmp_dir to

> a directory other than /tmp while calling ompi-restart command, this

> setting is not getting passed to the mpirun command that gets

> generated by ompi-restart. How do I overcome this constraint?

> > > >

> > > >

> > > >

> > > > Thanks

> > > >

> > > > Ananda

> > > >

> > > > ==

> > > >

> > > > Subject: Re: [OMPI users] opal_cr_tmp_dir

> > > > From: Ralph Castain (rhc_at_[hidden])

> > > > Date: 2010-05-12 14:38:00

> > > >

> > > > Previous message: ananda.mudar_at_[hidden]: "[OMPI users]

> opal_cr_tmp_dir"

> > > > In reply to: ananda.mudar_at_[hidden]: "[OMPI users]

> opal_cr_tmp_dir"

> > > > It's a different MCA param: orte_tmpdir_base

> > > >

> > > > On May 12, 2010, at 12:33 PM,  wrote:

> > > >

> > > > > I am setting the MCA parameter ?opal_cr_tmp_dir? to a

> directory other than /tmp while calling ?mpirun?, ?ompi-restart?, and

> ?ompi-checkpoint? commands so that I don?t fill up /tmp filesystem.

> But I see that openmpi-sessions* directory is still getting created

> under /tmp. How do I overcome this problem so that

> openmpi-sessions* directory also gets created under the same directory


> I have defined for ?opal_cr_tmp_dir??

> > > > >

> > > > > Is there a way to clean up these temporary files after their

> requirement is over?

> > > > >

> > > > > Thanks

> > > > > Ananda

> > > > > Please do not print this 

[OMPI users] ompi-restart fails with "found pid in use"

2010-05-14 Thread ananda.mudar
Hi



I am using open mpi v1.3.4 with BLCR 0.8.2. I have been testing my
openmpi based program on a 3-node cluster (each node is a Intel Nehalem
based dual quad core) and I have been successful in checkpointing and
restarting the program successfully multiple times.



Recently I moved to a 15 node cluster with the same configuration and I
started seeing the problem with ompi-restart.



Ompi-checkpoint gets completed successfully and I terminate the program
after that. I have ensured that there are no MPI processes before I
restarted. When I restart using ompi-restart, I get the error in
restarting few of the MPI processes and the error message is "found pid
4185 in use; Restart failed: Device or Resource busy" (of course with
different pid numbers). What I found was that when the MPI process was
restarted, it got restarted on a different node than what it was running
before termination and found that it cannot reuse the pid.



Unlike cr_restart (BLCR), ompi-restart doesn't have an option to say not
to use the same pid with option such as "--no-restore-pid". Since
ompi-restart in turn calls cr_restart, I tried to alias cr_restart to
"cr_restart --no-restore-pid". This actually made the problem "pid in
use" go away and the process completes successfully. However if I call
ompi-checkpoint on the restarted open MPI job, both the openMPI job (all
MPI processes) and the checkpoint command hang forever. I guess this is
because of the fact that ompi-restart has different set of pids compared
to the actual pids that are running.



Long story short, I am stuck with this problem as I cannot get the
original pids during restart.



I really appreciate if you have any other options to share with me which
I can use to overcome this problem.



Thanks

Ananda


Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


Re: [OMPI users] opal_cr_tmp_dir

2010-05-13 Thread ananda.mudar
Ralph

Defining these parameters in my environment also did not resolve the
problem. Whenever I restart my program, the temporary files are getting
stored in the default /tmp directory instead of the directory I had
defined.

Thanks

Ananda

=

Subject: Re: [OMPI users] opal_cr_tmp_dir
From: Ralph Castain (rhc_at_[hidden])
List-Post: users@lists.open-mpi.org
Date: 2010-05-12 19:48:16

*   Previous message: ananda.mudar_at_[hidden]: "Re: [OMPI users]
opal_cr_tmp_dir"

*   In reply to: ananda.mudar_at_[hidden]: "Re: [OMPI users]
opal_cr_tmp_dir"




Define them in your environment prior to executing any of those
commands.

On May 12, 2010, at 4:43 PM,  wrote:

> Ralph
>
> When you say manually, do you mean setting these parameters in the
command line while calling mpirun, ompi-restart, and ompi-checkpoint? Or
is there another way to set these parameters?
>
> Thanks
>
> Ananda
>
> ==
>
> Subject: Re: [OMPI users] opal_cr_tmp_dir
> From: Ralph Castain (rhc_at_[hidden])
> Date: 2010-05-12 18:09:17
>
> Previous message: ananda.mudar_at_[hidden]: "Re: [OMPI users]
opal_cr_tmp_dir"
> In reply to: ananda.mudar_at_[hidden]: "Re: [OMPI users]
opal_cr_tmp_dir"
> You shouldn't have to, but there may be a bug in the system. Try
manually setting both envars and see if it fixes the problem.
>
> On May 12, 2010, at 3:59 PM,  wrote:
>
> > Ralph
> >
> > I have these parameters set in ~/.openmpi/mca-params.conf file
> >
> > $ cat ~/.openmpi/mca-params.conf
> >
> > orte_tmpdir_base = /home/ananda/ORTE
> >
> > opal_cr_tmp_dir = /home/ananda/OPAL
> >
> > $
> >
> >
> >
> > Should I be setting OMPI_MCA_opal_cr_tmp_dir?
> >
> >
> >
> > FYI, I am using openmpi 1.3.4 with blcr 0.8.2
> >
> >
> > Thanks
> >
> > Ananda
> >
> > =
> >
> > Subject: Re: [OMPI users] opal_cr_tmp_dir
> > From: Ralph Castain (rhc_at_[hidden])
> > Date: 2010-05-12 16:47:16
> >
> > Previous message: Jeff Squyres: "Re: [OMPI users] getc in openmpi"
> > In reply to: ananda.mudar_at_[hidden]: "Re: [OMPI users]
opal_cr_tmp_dir"
> > ompi-restart just does a fork/exec of the mpirun, so it should get
the param if it is in your environ. How are you setting it? Have you
tried adding OMPI_MCA_opal_cr_tmp_dir= to your environment?
> >
> > On May 12, 2010, at 12:45 PM,  wrote:
> >
> > > Thanks Ralph.
> > >
> > > Another question. Even though I am setting opal_cr_tmp_dir to a
directory other than /tmp while calling ompi-restart command, this
setting is not getting passed to the mpirun command that gets generated
by ompi-restart. How do I overcome this constraint?
> > >
> > >
> > >
> > > Thanks
> > >
> > > Ananda
> > >
> > > ==
> > >
> > > Subject: Re: [OMPI users] opal_cr_tmp_dir
> > > From: Ralph Castain (rhc_at_[hidden])
> > > Date: 2010-05-12 14:38:00
> > >
> > > Previous message: ananda.mudar_at_[hidden]: "[OMPI users]
opal_cr_tmp_dir"
> > > In reply to: ananda.mudar_at_[hidden]: "[OMPI users]
opal_cr_tmp_dir"
> > > It's a different MCA param: orte_tmpdir_base
> > >
> > > On May 12, 2010, at 12:33 PM,  wrote:
> > >
> > > > I am setting the MCA parameter "opal_cr_tmp_dir" to a directory
other than /tmp while calling "mpirun", "ompi-restart", and
"ompi-checkpoint" commands so that I don't fill up /tmp filesystem. But
I see that openmpi-sessions* directory is still getting created under
/tmp. How do I overcome this problem so that openmpi-sessions* directory
also gets created under the same directory I have defined for
"opal_cr_tmp_dir"?
> > > >
> > > > Is there a way to clean up these temporary files after their
requirement is over?
> > > >
> > > > Thanks
> > > > Ananda
> > > > Please do not print this email unless it is absolutely
necessary.
> > > >
> > > > The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any attachments.
> > > >
> > > > WARNING: Computer viruses can be transmitted via email. The
recipient should check this email and any attachments for the presence
of viruses. The company accepts no liability for any damage caused by
any virus transmitted by this email.
> > > >
> > > > www.wipro.com
> > > >
> > > > ___
> > > > users mailing list
> > > > users_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > > Please do not print this email unless it is absolutely necessary.
> > >
> > > The information contained in this 

Re: [OMPI users] opal_cr_tmp_dir

2010-05-12 Thread ananda.mudar
Ralph

When you say manually, do you mean setting these parameters in the
command line while calling mpirun, ompi-restart, and ompi-checkpoint? Or
is there another way to set these parameters?

Thanks

Ananda

==

Subject: Re: [OMPI users] opal_cr_tmp_dir
From: Ralph Castain (rhc_at_[hidden])
List-Post: users@lists.open-mpi.org
Date: 2010-05-12 18:09:17

*   Previous message: ananda.mudar_at_[hidden]: "Re: [OMPI users]
opal_cr_tmp_dir"

*   In reply to: ananda.mudar_at_[hidden]: "Re: [OMPI users]
opal_cr_tmp_dir"




You shouldn't have to, but there may be a bug in the system. Try
manually setting both envars and see if it fixes the problem.

On May 12, 2010, at 3:59 PM,  wrote:

> Ralph
>
> I have these parameters set in ~/.openmpi/mca-params.conf file
>
> $ cat ~/.openmpi/mca-params.conf
>
> orte_tmpdir_base = /home/ananda/ORTE
>
> opal_cr_tmp_dir = /home/ananda/OPAL
>
> $
>
>
>
> Should I be setting OMPI_MCA_opal_cr_tmp_dir?
>
>
>
> FYI, I am using openmpi 1.3.4 with blcr 0.8.2
>
>
> Thanks
>
> Ananda
>
> =
>
> Subject: Re: [OMPI users] opal_cr_tmp_dir
> From: Ralph Castain (rhc_at_[hidden])
> Date: 2010-05-12 16:47:16
>
> Previous message: Jeff Squyres: "Re: [OMPI users] getc in openmpi"
> In reply to: ananda.mudar_at_[hidden]: "Re: [OMPI users]
opal_cr_tmp_dir"
> ompi-restart just does a fork/exec of the mpirun, so it should get the
param if it is in your environ. How are you setting it? Have you tried
adding OMPI_MCA_opal_cr_tmp_dir= to your environment?
>
> On May 12, 2010, at 12:45 PM,  wrote:
>
> > Thanks Ralph.
> >
> > Another question. Even though I am setting opal_cr_tmp_dir to a
directory other than /tmp while calling ompi-restart command, this
setting is not getting passed to the mpirun command that gets generated
by ompi-restart. How do I overcome this constraint?
> >
> >
> >
> > Thanks
> >
> > Ananda
> >
> > ==
> >
> > Subject: Re: [OMPI users] opal_cr_tmp_dir
> > From: Ralph Castain (rhc_at_[hidden])
> > Date: 2010-05-12 14:38:00
> >
> > Previous message: ananda.mudar_at_[hidden]: "[OMPI users]
opal_cr_tmp_dir"
> > In reply to: ananda.mudar_at_[hidden]: "[OMPI users]
opal_cr_tmp_dir"
> > It's a different MCA param: orte_tmpdir_base
> >
> > On May 12, 2010, at 12:33 PM,  wrote:
> >
> > > I am setting the MCA parameter "opal_cr_tmp_dir" to a directory
other than /tmp while calling "mpirun", "ompi-restart", and
"ompi-checkpoint" commands so that I don't fill up /tmp filesystem. But
I see that openmpi-sessions* directory is still getting created under
/tmp. How do I overcome this problem so that openmpi-sessions* directory
also gets created under the same directory I have defined for
"opal_cr_tmp_dir"?
> > >
> > > Is there a way to clean up these temporary files after their
requirement is over?
> > >
> > > Thanks
> > > Ananda
> > > Please do not print this email unless it is absolutely necessary.
> > >
> > > The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any attachments.
> > >
> > > WARNING: Computer viruses can be transmitted via email. The
recipient should check this email and any attachments for the presence
of viruses. The company accepts no liability for any damage caused by
any virus transmitted by this email.
> > >
> > > www.wipro.com
> > >
> > > ___
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > Please do not print this email unless it is absolutely necessary.
> >
> > The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The
recipient should check this email and any attachments for the presence
of viruses. The company accepts no liability for any damage caused by
any virus transmitted by this email.
> >
> > www.wipro.com
> >
> > ___
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> Ananda B Mudar, PMP
> Senior Technical 

[OMPI users] (no subject)

2010-05-12 Thread ananda.mudar
Ralph

When you say manually, do you mean setting these parameters in the
command line while calling mpirun, ompi-restart, and ompi-checkpoint? Or
is there another way to set these parameters?

Thanks

Ananda

==

Subject: Re: [OMPI users] opal_cr_tmp_dir
From: Ralph Castain (rhc_at_[hidden])
List-Post: users@lists.open-mpi.org
Date: 2010-05-12 18:09:17

*   Previous message: ananda.mudar_at_[hidden]: "Re: [OMPI users]
opal_cr_tmp_dir"

*   In reply to: ananda.mudar_at_[hidden]: "Re: [OMPI users]
opal_cr_tmp_dir"




You shouldn't have to, but there may be a bug in the system. Try
manually setting both envars and see if it fixes the problem.

On May 12, 2010, at 3:59 PM,  wrote:

> Ralph
>
> I have these parameters set in ~/.openmpi/mca-params.conf file
>
> $ cat ~/.openmpi/mca-params.conf
>
> orte_tmpdir_base = /home/ananda/ORTE
>
> opal_cr_tmp_dir = /home/ananda/OPAL
>
> $
>
>
>
> Should I be setting OMPI_MCA_opal_cr_tmp_dir?
>
>
>
> FYI, I am using openmpi 1.3.4 with blcr 0.8.2
>
>
> Thanks
>
> Ananda
>
> =
>
> Subject: Re: [OMPI users] opal_cr_tmp_dir
> From: Ralph Castain (rhc_at_[hidden])
> Date: 2010-05-12 16:47:16
>
> Previous message: Jeff Squyres: "Re: [OMPI users] getc in openmpi"
> In reply to: ananda.mudar_at_[hidden]: "Re: [OMPI users]
opal_cr_tmp_dir"
> ompi-restart just does a fork/exec of the mpirun, so it should get the
param if it is in your environ. How are you setting it? Have you tried
adding OMPI_MCA_opal_cr_tmp_dir= to your environment?
>
> On May 12, 2010, at 12:45 PM,  wrote:
>
> > Thanks Ralph.
> >
> > Another question. Even though I am setting opal_cr_tmp_dir to a
directory other than /tmp while calling ompi-restart command, this
setting is not getting passed to the mpirun command that gets generated
by ompi-restart. How do I overcome this constraint?
> >
> >
> >
> > Thanks
> >
> > Ananda
> >
> > ==
> >
> > Subject: Re: [OMPI users] opal_cr_tmp_dir
> > From: Ralph Castain (rhc_at_[hidden])
> > Date: 2010-05-12 14:38:00
> >
> > Previous message: ananda.mudar_at_[hidden]: "[OMPI users]
opal_cr_tmp_dir"
> > In reply to: ananda.mudar_at_[hidden]: "[OMPI users]
opal_cr_tmp_dir"
> > It's a different MCA param: orte_tmpdir_base
> >
> > On May 12, 2010, at 12:33 PM,  wrote:
> >
> > > I am setting the MCA parameter "opal_cr_tmp_dir" to a directory
other than /tmp while calling "mpirun", "ompi-restart", and
"ompi-checkpoint" commands so that I don't fill up /tmp filesystem. But
I see that openmpi-sessions* directory is still getting created under
/tmp. How do I overcome this problem so that openmpi-sessions* directory
also gets created under the same directory I have defined for
"opal_cr_tmp_dir"?
> > >
> > > Is there a way to clean up these temporary files after their
requirement is over?
> > >
> > > Thanks
> > > Ananda
> > > Please do not print this email unless it is absolutely necessary.
> > >
> > > The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any attachments.
> > >
> > > WARNING: Computer viruses can be transmitted via email. The
recipient should check this email and any attachments for the presence
of viruses. The company accepts no liability for any damage caused by
any virus transmitted by this email.
> > >
> > > www.wipro.com
> > >
> > > ___
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > Please do not print this email unless it is absolutely necessary.
> >
> > The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The
recipient should check this email and any attachments for the presence
of viruses. The company accepts no liability for any damage caused by
any virus transmitted by this email.
> >
> > www.wipro.com
> >
> > ___
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> Ananda B Mudar, PMP
> Senior Technical 

Re: [OMPI users] opal_cr_tmp_dir

2010-05-12 Thread ananda.mudar
Ralph

I have these parameters set in ~/.openmpi/mca-params.conf file

$ cat ~/.openmpi/mca-params.conf

orte_tmpdir_base = /home/ananda/ORTE

opal_cr_tmp_dir = /home/ananda/OPAL

$



Should I be setting OMPI_MCA_opal_cr_tmp_dir?



FYI, I am using openmpi 1.3.4 with blcr 0.8.2


Thanks

Ananda

=

Subject: Re: [OMPI users] opal_cr_tmp_dir
From: Ralph Castain (rhc_at_[hidden])
List-Post: users@lists.open-mpi.org
Date: 2010-05-12 16:47:16

*   Previous message: Jeff Squyres: "Re: [OMPI users] getc in
openmpi"

*   In reply to: ananda.mudar_at_[hidden]: "Re: [OMPI users]
opal_cr_tmp_dir"




ompi-restart just does a fork/exec of the mpirun, so it should get the
param if it is in your environ. How are you setting it? Have you tried
adding OMPI_MCA_opal_cr_tmp_dir= to your environment?

On May 12, 2010, at 12:45 PM,  wrote:

> Thanks Ralph.
>
> Another question. Even though I am setting opal_cr_tmp_dir to a
directory other than /tmp while calling ompi-restart command, this
setting is not getting passed to the mpirun command that gets generated
by ompi-restart. How do I overcome this constraint?
>
>
>
> Thanks
>
> Ananda
>
> ==
>
> Subject: Re: [OMPI users] opal_cr_tmp_dir
> From: Ralph Castain (rhc_at_[hidden])
> Date: 2010-05-12 14:38:00
>
> Previous message: ananda.mudar_at_[hidden]: "[OMPI users]
opal_cr_tmp_dir"
> In reply to: ananda.mudar_at_[hidden]: "[OMPI users] opal_cr_tmp_dir"
> It's a different MCA param: orte_tmpdir_base
>
> On May 12, 2010, at 12:33 PM,  wrote:
>
> > I am setting the MCA parameter "opal_cr_tmp_dir" to a directory
other than /tmp while calling "mpirun", "ompi-restart", and
"ompi-checkpoint" commands so that I don't fill up /tmp filesystem. But
I see that openmpi-sessions* directory is still getting created under
/tmp. How do I overcome this problem so that openmpi-sessions* directory
also gets created under the same directory I have defined for
"opal_cr_tmp_dir"?
> >
> > Is there a way to clean up these temporary files after their
requirement is over?
> >
> > Thanks
> > Ananda
> > Please do not print this email unless it is absolutely necessary.
> >
> > The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The
recipient should check this email and any attachments for the presence
of viruses. The company accepts no liability for any damage caused by
any virus transmitted by this email.
> >
> > www.wipro.com
> >
> > ___
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
should check this email and any attachments for the presence of viruses.
The company accepts no liability for any damage caused by any virus
transmitted by this email.
>
> www.wipro.com
>
> ___
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users





Ananda B Mudar, PMP

Senior Technical Architect

Wipro Technologies

Ph: 972 765 8093

ananda.mu...@wipro.com




Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


Re: [OMPI users] opal_cr_tmp_dir

2010-05-12 Thread ananda.mudar
Thanks Ralph.

Another question. Even though I am setting opal_cr_tmp_dir to a
directory other than /tmp while calling ompi-restart command, this
setting is not getting passed to the mpirun command that gets generated
by ompi-restart. How do I overcome this constraint?



Thanks

Ananda

==

Subject: Re: [OMPI users] opal_cr_tmp_dir
From: Ralph Castain (rhc_at_[hidden])
List-Post: users@lists.open-mpi.org
Date: 2010-05-12 14:38:00

*   Previous message: ananda.mudar_at_[hidden]: "[OMPI users]
opal_cr_tmp_dir"

*   In reply to: ananda.mudar_at_[hidden]: "[OMPI users]
opal_cr_tmp_dir"




It's a different MCA param: orte_tmpdir_base

On May 12, 2010, at 12:33 PM,  wrote:

> I am setting the MCA parameter "opal_cr_tmp_dir" to a directory other
than /tmp while calling "mpirun", "ompi-restart", and "ompi-checkpoint"
commands so that I don't fill up /tmp filesystem. But I see that
openmpi-sessions* directory is still getting created under /tmp. How do
I overcome this problem so that openmpi-sessions* directory also gets
created under the same directory I have defined for "opal_cr_tmp_dir"?
>
> Is there a way to clean up these temporary files after their
requirement is over?
>
> Thanks
> Ananda
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
should check this email and any attachments for the presence of viruses.
The company accepts no liability for any damage caused by any virus
transmitted by this email.
>
> www.wipro.com
>
> ___
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


[OMPI users] opal_cr_tmp_dir

2010-05-12 Thread ananda.mudar
I am setting the MCA parameter "opal_cr_tmp_dir" to a directory other
than /tmp while calling "mpirun", "ompi-restart", and "ompi-checkpoint"
commands so that I don't fill up /tmp filesystem. But I see that
openmpi-sessions* directory is still getting created under /tmp. How do
I overcome this problem so that openmpi-sessions* directory also gets
created under the same directory I have defined for "opal_cr_tmp_dir"?



Is there a way to clean up these temporary files after their requirement
is over?



Thanks

Ananda


Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


[OMPI users] ompi-checkpoint fails sometimes

2010-05-11 Thread ananda.mudar
Hi



I am using open-mpi 1.3.4 with BLCR. Sometimes I am running into a
strange problem with ompi-checkpoint command. Even though I see that all
MPI processes (equal to np argument) are running, ompi-checkpoint
command fails at times. I have seen this failure always when the MPI
processes spawned are not fully running ie; these processes are not
running above 90% CPU utilization. How do I ensure that the MPI
processes are fully up and running before I issue ompi-checkpoint
because dynamically detecting if the processes are utilizing above 90%
CPU resources is not easy.



Are there any MCA parameters I can use to overcome this issue?



Thanks

Ananda


Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


[OMPI users] Meaning and the significance of MCA parameter "opal_cr_use_thread"

2010-03-24 Thread ananda.mudar
The description for MCA parameter "opal_cr_use_thread" is very short at
URL:  http://osl.iu.edu/research/ft/ompi-cr/api.php



Can someone explain the usefulness of enabling this parameter vs
disabling it? In other words, what are pros/cons of disabling it?



 I found that this gets enabled automatically when openmpi library is
configured with -ft-enable-threads option.



Thanks

Ananda


Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


[OMPI users] mpirun with -am ft-enable-cr option runs slow if hyperthreading is disabled

2010-03-22 Thread ananda.mudar
Hi



If the run my compute intensive openmpi based program using regular
invocation of mpirun (ie; mpirun -host  -np ), it gets completed in few seconds but if I run the same program
with "-am ft-enable-cr" option, the program takes 10x time to complete.



If I enable hyperthreading on my cluster nodes and then call mpirun with
"-am ft-enable-cr" option, the program gets completed with few
additional seconds than the normal mpirun!!



How can I improve the performance of mpirun with "-am ft-enable-cr"
option when I disable hyperthreading on my cluster nodes? Any pointers
will be really useful.



FYI, I am using openmpi 1.3.4 library and BLCR 0.8.2. Cluster nodes are
Nehalem based nodes with  8 cores.



Thanks
Anand


Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


[OMPI users] top command output shows huge CPU utilization when openmpi processes resume after the checkpoint

2010-03-21 Thread ananda.mudar
When I checkpoint my openmpi application using ompi_checkpoint, I see
that top command suddenly shows some really huge numbers in "CPU %"
field such as 150% 200% etc. After sometime, these numbers do come back
to the normal numbers under 100%. This happens exactly around the time
checkpoint is completed and when the processes are resuming the
execution.



Another behavior I have seen is that one MPI process starts to show
different elapsed time than its peers. Is it because checkpoint happened
on behalf of this process?



For your reference, I am using open mpi 1.3.4 and BLCR 0.8.2 for
checkpointing.



Thanks
Anand


Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


[OMPI users] mpirun with -am ft-enable-cr option takes longer time on certain configurations

2010-03-21 Thread ananda.mudar
I am observing a very strange performance issue with my openmpi program.



I have compute intensive openmpi based application that keeps the data
in memory, process the data and then dumps it to GPFS parallel file
system. GPFS parallel file system server is connected to a QDR
infiniband switch from Voltaire.



If my cluster is connected to a DDR infiniband switch which in turn
connects to file system server on QDR switch, I see that I can run my
application under checkpoint/restart control (with -am ft-enable-cr) and
I can checkpoint (ompi-checkpoint) successfully and the application gets
completed after few additional seconds.



If my cluster is connected to the same QDR switch which connects to file
system server, I see that my application takes close to 10x time to
complete if I run it under checkpoint/restart control (with -am
ft-enable-cr). If I run the same application using a plain mpirun
command (ie; without -am ft_enable_cr), it finishes within a minute.



I am using open mpi 1.3.4 and BLCR 0.8.2 for checkpointing



Are there any specific MCA parameters that I should tune to address this
problem? Any other pointers will be really helpful.



Thanks

Anand


Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com